Founding Thesis

The dataset biological
foundation models need
does not yet exist.

The last decade of machine learning has taught a consistent lesson: models are bounded by the quality and structure of their training data, not by their architecture. AlphaFold 2 solved protein structure prediction not because its neural network was novel, but because it had access to the Protein Data Bank — a decades-long, systematic accumulation of structural ground truth.

Biology is now in the phase where the same dynamic applies. Unlike protein folding, where the field had to wait for both the models and the data, the models are already being built. Xaira committed $1B to building the first virtual cell. CZI launched a platform to simulate every human cell type. Isomorphic Labs raised $600M to understand cellular biology at a mechanistic level. Every one of these initiatives has identified the same bottleneck: large-scale, structured, physiologically relevant perturbation data in the cell types that matter clinically.

The field is not waiting for its AlphaFold. The AlphaFolds are funded and building. What they are waiting for is their Protein Data Bank.

Several large-scale perturbation datasets exist. Each has structural limitations that prevent it from serving as training data for whole-cell simulation models.

LINCS L1000
~1.3M profiles Measures ~1,000 landmark genes and infers the rest. Cancer cell lines throughout. Designed to answer drug discovery questions, not train generative models. Single modality.
JUMP-CP
~116K compounds Largest public morphological dataset. Uses U2OS osteosarcoma cells chosen for imaging convenience, not biological relevance. Single modality. Minimal combination perturbations.
Tahoe-100M
100M profiles Most recent large-scale atlas. Transcriptomics only. Cancer cell lines exclusively. Designed for pharma partnership revenue, not AI training.
Illumina BioInsight
Billion Cell Atlas Transcriptomics only. Immortalized cell lines. Owned by Illumina — creating a structural reason for every buyer to want an independent source.
Xaira X-Atlas
~8M single-cell profiles Strongest public dataset to date. Transcriptomics only. HCT116 and HEK293T cell lines — both immortalized. Xaira has publicly stated iPSC-derived systems are their next step.

What no existing dataset provides: simultaneous multimodal readouts — morphological and transcriptomic — from identically perturbed cells in matched parallel arms, in a physiologically relevant human cell type. This paired design is what whole-cell simulation models require to learn the mechanistic relationships between cellular subsystems. A model trained on morphology alone cannot learn why a cell looks the way it looks. The integration is where the predictive power lives.

If Xaira, Recursion, and Illumina are all working on this problem with hundreds of millions of dollars, why hasn't the gap been filled? The answer is not that they are uninformed. Three structural reasons explain it.

Path dependency. Every major dataset was built around what was experimentally convenient — cancer cell lines that grow fast and transfect easily, compound libraries drawn from pharmaceutical freezers. Large organizations cannot easily deprecate years of infrastructure to start fresh in iPSC-derived systems. On this specific problem, the playing field is closer to level than it appears from the outside.

Misaligned incentives. Recursion optimizes for their drug discovery pipeline. Tahoe optimizes for pharma partnership revenue. Illumina optimizes for sequencing instrument sales. The dataset designed purely to maximize its value as AI training data does not exist from any of them — not because they couldn't build it, but because they are optimizing for other objectives simultaneously.

The independence argument. Xaira will not buy training data from Isomorphic. Isomorphic will not buy training data from Recursion. A neutral third-party dataset, licensed non-exclusively, is more commercially attractive to each buyer than any competitor's dataset — because an in-house dataset cannot be licensed across the field, while a neutral dataset licenses to everyone simultaneously. This is why the Protein Data Bank became the reference resource rather than any single lab's data collection.

Cloud lab automation has matured. Five years ago, generating a perturbation dataset at the required scale meant a $50M internal facility and years of protocol development. Today, fully automated remote laboratory infrastructure means a small team can design experiments and receive structured data without the capital overhead of a traditional research organization. The cost structure of dataset generation has dropped by an order of magnitude.

The demand signal is explicit. Frontier foundation model labs are not hypothetical future buyers. They are funded, building now, and publicly articulating what they need. The X-Atlas dataset was downloaded over 16,000 times in its first two weeks — the field's appetite for high-quality perturbation data is not in question.

The window is open but not indefinitely. The multimodal, iPSC-derived, AI-optimized dataset does not yet exist. That will change. The question is who defines what the reference dataset looks like — and whether it is built by an organization whose only objective is to make it the highest-quality training resource in the field.

Perturb is building this dataset. Large-scale, multimodal perturbation data — transcriptomics and morphology from identically perturbed cells in matched parallel arms — in physiologically relevant iPSC-derived human cell systems, with perturbation libraries designed from first principles to maximize mechanistic coverage for model training.

The biological foundation models are being built. We are building what they train on.

Get in touch
daniel@perturb.bio