🎯 Core Concepts
The Challenge
Understanding how cells respond to perturbations (genetic knockouts, drug treatments, etc.) is
fundamental to biology and medicine. However:
-
Experimental Limitations: Testing all possible perturbations is prohibitively
expensive and time-consuming
-
Combinatorial Explosion: For n genes, there are n(n-1)/2 possible pairwise
combinations - grows exponentially
-
Cellular Heterogeneity: Individual cells respond differently to the same
perturbation
-
Context-Dependence: Effects vary across cell types, conditions, and genetic
backgrounds
-
Data Quality: Single-cell data is sparse, noisy, and affected by batch effects
Key Biological Facts
-
Only ~41% of gene perturbations have measurable transcriptome-wide effects
(Replogle et al., 2022)
-
Typical gene perturbation affects ~45 genes; essential genes affect
>500 genes
-
Perturbation effects exhibit mixture distributions - some cells escape
perturbation entirely
- Most effects are small: 86.6% below 0.01 log-fold change
-
Network structure matters: 77.3% of direct regulators (distance 1 in GRN)
confer moderate-to-strong effects
The Opportunity
Computational prediction methods aim to:
- Predict cellular responses to unseen perturbations without experiments
-
Model combinatorial perturbations (e.g., gene pairs) trained only on
single perturbations
-
Enable rational experimental design by prioritizing promising perturbations
- Uncover mechanisms of action and genetic interactions
- Accelerate drug discovery by predicting compound effects
📈 Large-Scale Datasets & Platforms (2024-2025)
Industrial-scale perturbation atlases provide massive training data for foundation models.
These represent significant experimental advances in scale and throughput.
Tahoe-100M: Giga-Scale Single-Cell Perturbation Atlas
2025
bioRxiv (Feb 2025)
Dataset (100M cells)
Largest public perturbation atlas: 100M cells, 50 cancer cell lines, 1,100 drugs.
Vevo Therapeutics' Mosaic platform with Parse GigaLab sequencing. Open-sourced on
Arc Virtual Cell Atlas and HuggingFace.
- SNP-based genetic demultiplexing (>98% accuracy)
- 1,786 sublibraries, 1.4 trillion reads
- Context-dependent drug response across diverse cancer backgrounds
- VERIFIED: Real dataset, publicly available
scPerturb: Harmonized Single-Cell Perturbation Data
2024
Nature Methods
Repository
Harmonized repository of 44 datasets from 25 publications. 32 CRISPR + 9 drug datasets,
average 160K cells per dataset. Introduces Energy statistics for perturbation quantification.
- Standardized processing across diverse experimental platforms
- E-statistics framework for perturbation effect quantification
- Unified access through scperturb.org
Replogle et al. 2022: Genome-Wide Perturb-seq
2022
Cell
Dataset
Landmark genome-wide study: K562 (>2.5M cells, 9,867 genes targeted with CRISPRi),
RPE1 Essential. Key finding: only 41% of perturbations show measurable transcriptome-wide effects.
- First genome-wide single-cell perturbation screen
- Revealed sparsity of perturbation effects
- Established standards for large-scale Perturb-seq
X-Atlas/Orion: Genome-Wide Perturb-seq via FiCS Platform
2025
bioRxiv
Platform (8M cells)
Fix-Cryopreserve-ScRNAseq platform achieving 8M cells targeting all human protein-coding genes.
sgRNA abundance as dose-dependent proxy (R=0.91 with KD efficiency).
- DSP fixation + superloading (5x throughput)
- Hamilton automation removes operator variability
- 140+ day cryopreservation stability
Mixscale: Systematic Reconstruction of Molecular Pathway Signatures
2025
Nature Cell Biology
Method + Dataset
Continuous quantification of perturbation strength replacing binary classification. Systematic
pathway profiling across 6 cell lines × 5 signaling contexts (2.6M cells).
- Mixscale scoring: s_i = p_i × (p̄_r - p̄_NT)
- Weighted differential expression (wmvReg)
- 93% replication rate vs 84-89% (standard methods)
📈 Key Datasets & Benchmarks (Summary)
Major Perturbation Datasets
-
Replogle et al. 2022 (Cell): Largest to date - K562 Genome-Wide (>2.5M cells,
9,867 genes targeted with CRISPRi), K562 Essential (~400K cells), RPE1 Essential (~300K cells).
Only 41% of perturbations show measurable effects.
-
Norman et al. 2019 (Science): K562, 287 perturbations (100 single + 131 double
combinations), CRISPRa, 19,264 genes measured. Primary benchmark for combinatorial perturbations.
-
Srivatsan et al. 2020 sci-Plex (Science): Chemical perturbations, 188 compounds
× 4 doses, K562/A549/MCF-7 cell lines, ~650K cells total.
-
scPerturb (Peidli et al., Nature Methods 2024): Harmonized repository of 44
datasets from 25 publications, 32 CRISPR + 9 drug datasets, average 160K cells per dataset.
Standard Evaluation Metrics
Population-Level (Fit-Based):
- MSE/RMSE on top 1,000-2,000 most expressed genes
- Pearson Correlation (PCC) between predicted and observed profiles
- Pearson Delta (R²): cor(ŷ - y^control, y - y^control)
- R² Score: Standard regression explained variance
Distribution-Based:
- Energy Distance: 2·E[||X-Y||] - E[||X-X'||] - E[||Y-Y'||]
- Wasserstein Distance (W2): Optimal transport cost
- MMD (Maximum Mean Discrepancy): With RBF kernel
Rank-Based (Critical for Mode Collapse Detection):
- Rank metric: Measures how well predictions rank-order cells
- Transposed-Rank: More challenging variant
- Matrix Distance: Similarity matrix divergence
- These metrics are ESSENTIAL - many papers only report MSE/R² which miss mode collapse
📚 Method Comparison
| Method |
Year |
Architecture |
Scale |
Key Innovation |
| Perturb-Seq |
2016 |
Experimental |
~10K-200K cells |
CRISPR + scRNA-seq pooled screens |
| scGen |
2019 |
VAE |
~50K cells |
Latent space vector arithmetic |
| CellOT |
2023 |
Neural OT |
~50K cells |
ICNN transport maps, single-cell predictions |
| SAMS-VAE |
2023 |
VAE |
~120K cells |
Sparse additive mechanism shifts |
| CPA |
2023 |
VAE |
~100K cells |
Compositional perturbation + covariates |
| GEARS |
2023/2024 |
GNN |
~600K cells |
Dual knowledge graphs (GO + coexpression) |
| CellOracle |
2023 |
GRN |
~100K cells |
scATAC-seq informed GRN inference |
| scGPT |
2024 |
Transformer (100M) |
33M cells pretrain |
Generative pretraining (BUT often underperforms) |
| scFoundation |
2024 |
Transformer (100M) |
50M cells |
Read-depth-aware (BUT mode collapse issues) |
| CIPHER |
2025 |
Statistical Physics |
~1.4M cells |
Linear response theory (Δx = Σu) |
Note: Publication years reflect official journal publication dates.
Some papers (e.g., GEARS) appeared online earlier than print publication.
💡 Practical Implementation Guide
Choosing the Right Method
Decision Framework
Step 1: Always implement simple baselines first:
- Additive model: Δ_combo = Δ_gene1 + Δ_gene2
- Linear regression with gene features
- Mean prediction (no-change model)
For Drug Response Prediction:
- Chemical perturbations → CellOT (single-cell), CPA if have dose info
- Cross-cell-type generalization → Test on multiple cell lines
- Dose-response modeling → Methods that model continuous covariates
For Genetic Perturbations:
- Single-gene → GEARS (if genes in GO), CIPHER, or simple linear models
- Combinatorial → Test additive baseline first, then GEARS or SAMS-VAE
- Genome-wide screens → scGPT/scFoundation
For Interpretability:
- Mechanistic understanding → CIPHER (covariance), CellOracle (GRN)
- Uncertainty quantification → GEARS (Bayesian), CIPHER (Horseshoe priors)
- Avoid black-box transformers if interpretability is priority
Common Pitfalls & Best Practices
-
Data Quality: Ensure high UMI counts (>1000), proper doublet removal,
batch correction before modeling. Use established pipelines (Scanpy, Seurat).
-
Evaluation: Use held-out PERTURBATIONS, not held-out cells from
training perturbations. This tests generalization properly.
-
Baseline Comparisons: Always compare against simple linear models.
-
Mode Collapse Detection: Use rank metrics, visualize prediction distributions.
If all predictions look similar, model has collapsed.
-
Generalization: Test cross-dataset, cross-cell-type, and out-of-distribution
generalization. Don't just report in-distribution performance.
-
Reproducibility: Report all hyperparameters, random seeds, preprocessing steps,
and provide code. Use version control.
-
Biological Validation: Computational predictions should be validated with
targeted experiments on key predictions.
Software Ecosystem
- Scanpy: Standard preprocessing, analysis, visualization (Python)
- scvi-tools: Implementations of scVI, scGen, CellOT
- Pertpy: Perturbation-specific analysis (Mixscape, Augur)
- Method-specific repos: GEARS, SAMS-VAE, scGPT, CPA all on GitHub
- PerturBench: Standardized benchmarking framework (altoslabs, 2024)