The Thesis — Perturb

The core argument

The last decade of machine learning has taught a consistent lesson: models are bounded by the quality and structure of their training data, not by their architecture. AlphaFold 2 solved protein structure prediction not because its neural network was novel, but because it had access to the Protein Data Bank, a decades-long, systematic accumulation of structural ground truth.

Biology is now in the phase where the same dynamic applies. Unlike protein folding, where the field had to wait for both the models and the data, the models are already being built. Xaira committed $1B to building the first virtual cell. CZI launched a platform to simulate every human cell type. Arc Institute formalized the gap with the first Virtual Cell Challenge, a competition requiring perturbation training data with held-out test sets that observational atlases cannot solve. Isomorphic Labs, spun out of Google DeepMind with a mandate to model all human biology, independently identified the same bottleneck. Every one of these initiatives has converged on the same gap: large-scale, structured, physiologically relevant perturbation data in the cell types that matter clinically.

The field is not waiting for its AlphaFold. The AlphaFolds are funded and building. What they are waiting for is their Protein Data Bank.

What exists — and why it falls short

Several large-scale perturbation datasets exist. Each has structural limitations that prevent it from serving as training data for whole-cell simulation models.

LINCS L1000

~1.3M profiles Measures ~1,000 landmark genes and infers the rest. Cancer cell lines throughout. Designed to answer drug discovery questions, not train generative models. Single modality.

JUMP-CP

~116K compounds Largest public morphological dataset. U2OS osteosarcoma cells. Single modality. Minimal combination perturbations.

Recursion

Imaging + RNA Closest to multimodal — but sequential, not matched-batch. Morphological data (HUVEC, cancer lines) combined with Roche's transcriptomic data on the same cell types, not from the same differentiation batch or identically perturbed wells. Built for Recursion's own drug discovery pipeline; not available as a neutral training corpus.

Tahoe-100M

100M profiles Large-scale transcriptomic atlas. Cancer cell lines exclusively. Pharma partnership model. Single modality.

Illumina BioInsight

Billion Cell Atlas Transcriptomics only. Immortalized cell lines. Vendor-owned; limits licensing independence for buyers who need data provenance unentangled from instrument supply.

Xaira X-Atlas / Pisces

~25.6M single-cell profiles Strongest public dataset to date. Transcriptomics only. Cancer cell lines and differentiating iPSC systems. The dataset is being built for Xaira's own foundation model program, not as a neutral reference corpus, and contains no multimodal readouts.

What no existing dataset provides: simultaneous multimodal readouts — morphological and transcriptomic — from identically perturbed cells in matched parallel arms, in a physiologically relevant human cell type. This co-measured design is what whole-cell simulation models require to learn the mechanistic relationships between cellular subsystems. Combining morphological readouts from one batch of cells with transcriptomics from a separate batch — even using the same perturbation, same cell type, same protocol — introduces biological variation that obscures the true causal cross-modal relationship. The integration that matters is within the same cells responding to the same perturbation. That is where the predictive power lives.

Why the gap persists

If Xaira, Recursion, and Illumina are all working on this problem with hundreds of millions of dollars, why hasn't the gap been filled? The answer is not that they are uninformed. Three structural reasons explain it.

Path Dependency

Every major dataset was built around experimental convenience: cancer cell lines that grow fast and transfect easily, compound libraries drawn from pharmaceutical freezers. Large organizations cannot easily deprecate years of infrastructure to start fresh in iPSC-derived systems. On this specific problem, the playing field is closer to level than it appears.

Misaligned Incentives

Recursion optimizes for their drug discovery pipeline. The Roche partnership validated nine-figure pricing for this data category, but the resulting dataset is not available to the broader market. Tahoe optimizes for pharma revenue. Illumina optimizes for instrument sales. The dataset built purely for AI training quality does not exist from any of them.

The Independence Argument

Xaira will not buy training data from a pharma competitor. Major pharma will not share proprietary data across rivals. A neutral dataset, licensed non-exclusively, is more commercially attractive to each buyer. An in-house dataset cannot be licensed across the field; a neutral dataset licenses to everyone simultaneously. This is why the Protein Data Bank became the reference rather than any single lab's collection.

Why now

Cloud Lab Automation

Five years ago, generating a perturbation dataset at the required scale meant a $50M internal facility and years of protocol development. Today, fully automated remote laboratory infrastructure means a small team can design experiments and receive structured data without the capital overhead of a traditional research organization. The cost structure of dataset generation has dropped by an order of magnitude. Perturb is built on this model: zero owned equipment, experiment design and execution through automated cloud laboratory APIs, no physical facility to build or maintain.

Explicit Demand Signal

Frontier foundation model labs are not hypothetical future buyers. They are funded, building now, and publicly articulating what they need. Every major initiative in the space has independently identified perturbation data in physiologically relevant cell types as the critical gap, and each is generating it internally precisely because no independent reference resource yet exists.

The Window Is Open

The multimodal, iPSC-derived, AI-optimized dataset does not yet exist. That will change. The question is who defines what the reference dataset looks like, and whether it is built by an organization whose only objective is to make it the highest-quality training resource in the field.

Perturb is building this dataset. Large-scale, multimodal perturbation data — transcriptomics and morphology from identically perturbed cells in matched parallel arms — beginning in iPSC-derived human cardiomyocytes: the cell type at the center of the largest single category of drug attrition, and the one for which no multimodal perturbation dataset exists at scale, with perturbation libraries designed from first principles to maximize mechanistic coverage for model training.

Each iteration compounds. Data from each round trains a model that nominates the next, selecting experiments that maximize information rather than sampling uniformly. Biological experiment space is too large for brute force; active learning is the only scalable path to a reference dataset worth the name.

The biological foundation models are being built. We are building what they train on.

Get in touch

daniel@perturb.bio

The dataset biologicalfoundation models needdoes not yet exist.

The dataset biological
foundation models need
does not yet exist.