🧬 Single-Cell Perturbation Modeling & Prediction

Computational Methods for Predicting Cellular Responses to Genetic & Chemical Perturbations

A comprehensive guide to state-of-the-art approaches combining single-cell transcriptomics, CRISPR screens, and machine learning to predict how cells respond to perturbations. Covering methods from 2016-2025, including recent critical evaluations of foundation models.

40+
Research Papers
9
Years of Innovation
100M+
Cells Analyzed
6
Model Categories

🎯 Core Concepts

The Challenge

Understanding how cells respond to perturbations (genetic knockouts, drug treatments, etc.) is fundamental to biology and medicine. However:

Key Biological Facts

The Opportunity

Computational prediction methods aim to:

📊 Method Categories

1. Variational Autoencoder (VAE) Approaches

scGen - Latent Space Arithmetic

2019 Nature Methods VAE

Pioneering VAE-based approach using vector arithmetic in latent space for perturbation prediction. Assumes homogeneous responses and additive perturbation effects.

  • First to use VAEs for perturbation prediction
  • Simple latent arithmetic: perturbed = control + perturbation_vector
  • Struggles with cell-type-specific and heterogeneous responses

SAMS-VAE - Sparse Additive Mechanism Shifts

2023 NeurIPS VAE

Combines compositionality, disentanglement, and interpretability through sparse global perturbation variables. Each perturbation targets sparse subset of latent dimensions.

  • Sparse additive latent perturbation: z_perturbed = z_control + Σ sparse_effects
  • Disentangled perturbation-specific latent subspaces
  • Outperforms CPA on combinatorial perturbation tasks

CPA - Compositional Perturbation Autoencoder

2023 Mol Sys Bio VAE

Conditional VAE with additive composition of perturbation and covariate embeddings. Predicts counterfactual distributions but requires training data for each perturbation.

  • Compositional: combines perturbation + cell-type + dose embeddings
  • Predicts full distributions, not just means
  • First method that predicted combinatorial drug and genetic perturbation

2. Graph Neural Network Methods

GEARS - Graph-Enhanced Gene Activation/Repression Simulator

2024 Nature Biotechnology GNN

Integrates dual knowledge graphs (gene coexpression + Gene Ontology perturbation network) with deep learning.

  • Dual GNN architecture: gene coexpression graph + GO perturbation graph
  • 40% higher precision in predicting genetic interaction subtypes vs baselines
  • Autofocus direction-aware loss + Bayesian uncertainty quantification
  • Limitation: Requires genes in GO knowledge graph; degrades for poorly connected genes

3. Optimal Transport Approaches

CellOT - Neural Optimal Transport

2023 Nature Methods Optimal Transport

Uses Input Convex Neural Networks (ICNNs) to learn transport maps between unpaired control and perturbed cell distributions. Predicts single-cell-level responses.

  • ICNNs ensure convexity of transport maps for guaranteed optimality
  • Handles unpaired data - doesn't need matched control/treated cells
  • Predicts individual cell trajectories, not just population means
  • Generalizes to holdout patients and cross-species transfer

4. Transformer & Foundation Models

scGPT - Generative Pretrained Transformer for scRNA-seq

2024 Nature Methods Transformer

Foundation model pretrained on 33+ million cells. Uses masked gene modeling with binned expression values. Fine-tuned for perturbation prediction with adapter layers.

  • 100M+ parameters, generative pre-training on massive scale
  • Gene-specific embeddings + expression value binning

scFoundation - 100M Parameter Foundation Model

2024 Nature Methods Transformer

Large-scale foundation model with read-depth-aware pretraining on 50M cells. Designed for multiple downstream tasks including perturbation prediction.

  • 100M parameters trained on diverse cell types and tissues
  • Read-depth normalization incorporated into architecture

State: Predicting Cellular Responses Across Diverse Contexts

2025 bioRxiv (June 2025) Transformer

Set-based transformer trained on 100M+ perturbation cells across 70 cell lines and 167M observational cells. Arc Institute's first virtual cell model using bidirectional attention over cell populations.

  • State Embedding (SE) + State Transition (ST) architecture
  • 50%+ improvement in perturbation effect discrimination on Tahoe-100M
  • 2× accuracy in identifying true differentially expressed genes
  • NOTE: Released June 2025 - too recent for independent validation

5. GRN-Based & Physics-Inspired Methods

CellOracle - GRN-Based Perturbation Prediction

2023 Nature GRN

Two-stage pipeline: (1) infer base GRN from scATAC-seq, (2) refine with scRNA-seq using Bayesian/Bagging Ridge regression. Linear propagation through GRN.

  • Integrates chromatin accessibility (scATAC-seq) for regulatory links
  • Successfully predicted zebrafish embryogenesis trajectories
  • Linear assumption: effect propagates through GRN edges
  • Interpretable - directly maps to biological regulatory networks

CIPHER - Linear Response Theory for Perturbations

2025 bioRxiv Statistical Physics

Physics-inspired approach using linear response theory: Δx = Σu where Σ is covariance matrix from control cell fluctuations. Bayesian inference with Horseshoe priors.

  • Covariance matrix Σ_ij = ⟨δx_i δx_j⟩ from control cells
  • R² up to 1.0 on synthetic networks with known ground truth
  • Soft mode analysis reveals ~3 dominant regulatory modules
  • Interpretable - connects to equilibrium statistical mechanics

📈 Large-Scale Datasets & Platforms (2024-2025)

Industrial-scale perturbation atlases provide massive training data for foundation models. These represent significant experimental advances in scale and throughput.

Tahoe-100M: Giga-Scale Single-Cell Perturbation Atlas

2025 bioRxiv (Feb 2025) Dataset (100M cells)

Largest public perturbation atlas: 100M cells, 50 cancer cell lines, 1,100 drugs. Vevo Therapeutics' Mosaic platform with Parse GigaLab sequencing. Open-sourced on Arc Virtual Cell Atlas and HuggingFace.

  • SNP-based genetic demultiplexing (>98% accuracy)
  • 1,786 sublibraries, 1.4 trillion reads
  • Context-dependent drug response across diverse cancer backgrounds
  • VERIFIED: Real dataset, publicly available

scPerturb: Harmonized Single-Cell Perturbation Data

2024 Nature Methods Repository

Harmonized repository of 44 datasets from 25 publications. 32 CRISPR + 9 drug datasets, average 160K cells per dataset. Introduces Energy statistics for perturbation quantification.

  • Standardized processing across diverse experimental platforms
  • E-statistics framework for perturbation effect quantification
  • Unified access through scperturb.org

Replogle et al. 2022: Genome-Wide Perturb-seq

2022 Cell Dataset

Landmark genome-wide study: K562 (>2.5M cells, 9,867 genes targeted with CRISPRi), RPE1 Essential. Key finding: only 41% of perturbations show measurable transcriptome-wide effects.

  • First genome-wide single-cell perturbation screen
  • Revealed sparsity of perturbation effects
  • Established standards for large-scale Perturb-seq

X-Atlas/Orion: Genome-Wide Perturb-seq via FiCS Platform

2025 bioRxiv Platform (8M cells)

Fix-Cryopreserve-ScRNAseq platform achieving 8M cells targeting all human protein-coding genes. sgRNA abundance as dose-dependent proxy (R=0.91 with KD efficiency).

  • DSP fixation + superloading (5x throughput)
  • Hamilton automation removes operator variability
  • 140+ day cryopreservation stability

Mixscale: Systematic Reconstruction of Molecular Pathway Signatures

2025 Nature Cell Biology Method + Dataset

Continuous quantification of perturbation strength replacing binary classification. Systematic pathway profiling across 6 cell lines × 5 signaling contexts (2.6M cells).

  • Mixscale scoring: s_i = p_i × (p̄_r - p̄_NT)
  • Weighted differential expression (wmvReg)
  • 93% replication rate vs 84-89% (standard methods)

📊 Benchmarking & Evaluation (2024-2025)

Ahlmann-Eltze, Huber & Anders: Critical Benchmark of Foundation Models

2025 Nature Methods Benchmark
  • Linear additive model often has LOWER MSE than foundation models
  • Detected mode collapse: predictions don't vary across perturbations
  • For double perturbations, additive baseline outperformed all deep learning
  • Changed how field evaluates perturbation prediction methods

Virtual Cell Challenge

2025 Cell (Commentary) Competition

Arc Institute's recurring benchmark competition for perturbation prediction. Aims to establish rigorous standards like CASP did for protein structure prediction. First challenge focuses on context generalization (H1 hESC).

  • $100K grand prize, sponsored by NVIDIA, 10x Genomics, Ultima
  • Training data: 500M+ cells from Arc Virtual Cell Atlas
  • Focus on genetic perturbations (single-gene knockdowns)
  • Aims to surface best practices and data quality standards

📈 Key Datasets & Benchmarks (Summary)

Major Perturbation Datasets

Standard Evaluation Metrics

Population-Level (Fit-Based):

Distribution-Based:

Rank-Based (Critical for Mode Collapse Detection):

📚 Method Comparison

Method Year Architecture Scale Key Innovation
Perturb-Seq 2016 Experimental ~10K-200K cells CRISPR + scRNA-seq pooled screens
scGen 2019 VAE ~50K cells Latent space vector arithmetic
CellOT 2023 Neural OT ~50K cells ICNN transport maps, single-cell predictions
SAMS-VAE 2023 VAE ~120K cells Sparse additive mechanism shifts
CPA 2023 VAE ~100K cells Compositional perturbation + covariates
GEARS 2023/2024 GNN ~600K cells Dual knowledge graphs (GO + coexpression)
CellOracle 2023 GRN ~100K cells scATAC-seq informed GRN inference
scGPT 2024 Transformer (100M) 33M cells pretrain Generative pretraining (BUT often underperforms)
scFoundation 2024 Transformer (100M) 50M cells Read-depth-aware (BUT mode collapse issues)
CIPHER 2025 Statistical Physics ~1.4M cells Linear response theory (Δx = Σu)

Note: Publication years reflect official journal publication dates. Some papers (e.g., GEARS) appeared online earlier than print publication.

💡 Practical Implementation Guide

Choosing the Right Method

Decision Framework

Step 1: Always implement simple baselines first:

  • Additive model: Δ_combo = Δ_gene1 + Δ_gene2
  • Linear regression with gene features
  • Mean prediction (no-change model)

For Drug Response Prediction:

  • Chemical perturbations → CellOT (single-cell), CPA if have dose info
  • Cross-cell-type generalization → Test on multiple cell lines
  • Dose-response modeling → Methods that model continuous covariates

For Genetic Perturbations:

  • Single-gene → GEARS (if genes in GO), CIPHER, or simple linear models
  • Combinatorial → Test additive baseline first, then GEARS or SAMS-VAE
  • Genome-wide screens → scGPT/scFoundation

For Interpretability:

  • Mechanistic understanding → CIPHER (covariance), CellOracle (GRN)
  • Uncertainty quantification → GEARS (Bayesian), CIPHER (Horseshoe priors)
  • Avoid black-box transformers if interpretability is priority

Common Pitfalls & Best Practices

Software Ecosystem