∫ Optimal Transport & Wasserstein Distance in Single-Cell Biology
Mathematical & Algorithmic Foundations for Cellular Distribution Analysis
A rigorous guide to optimal transport theory and Wasserstein metrics in computational biology.
From mathematical foundations (Monge, Kantorovich, entropic regularization) to modern applications
in trajectory inference, multi-omics integration, and distributional comparison. Covering landmark
methods (Waddington-OT, moscot, SCOT) and theoretical advances (2019-2025).
12
Research Papers
6
Years of Development
1.7M+
Cells Processed
5
Application Domains
🎯 Core Concepts
What is Optimal Transport?
Optimal transport theory provides a mathematical framework for comparing probability distributions
by computing the minimal "cost" of transforming one distribution into another. Originally formulated
by Gaspard Monge in 1781 and generalized by Leonid Kantorovich in 1942, the theory asks:
What is the most efficient way to move a pile of dirt from one location to match a desired shape elsewhere?
The Monge Problem: Moving Mass Optimally
Intuition: Each mass point (circle) must be transported to the target distribution along optimal paths (arrows) that minimize total cost.
The Wasserstein Distance
The Wasserstein distance (also called Earth Mover's Distance) quantifies the dissimilarity between
two probability distributions as the minimal cost to transport mass from one to the other. For
single-cell biology:
Wasserstein Distance: Measuring Distribution Dissimilarity
Key Insight: Wasserstein distance is proportional to transport path lengths—small distance for similar distributions, large for dissimilar ones.
Cell-Cell Similarity: Treating gene expression profiles as probability distributions,
Wasserstein distance captures how much "work" is needed to transform one cell's expression into another's
Population Alignment: Compares entire cell populations by finding optimal couplings
that preserve local neighborhood structure
Trajectory Inference: Models cellular differentiation as probability distributions
flowing through gene expression space over time
Why Optimal Transport for Biology?
Natural Framework: Biological processes (cell differentiation, population dynamics)
naturally involve mass transport with growth and death
Handles Unpaired Data: Unlike supervised methods, OT aligns datasets without requiring
matched cells or shared features
Probabilistic Interpretation: Coupling matrices provide interpretable cell-cell
correspondence probabilities for downstream analysis
Single-Cell Biology Applications
Summary: OT provides a unified mathematical framework for diverse single-cell problems: comparing cells, aligning populations, inferring trajectories, and integrating multi-omics data.
Key Challenges Addressed
Computational Tractability: Classical OT is computationally expensive (O(n³));
entropic regularization (Sinkhorn algorithm) reduces complexity to O(n² log n)
Unbalanced Mass: Biological systems have growth/death - unbalanced OT with KL
penalties accommodates varying cell numbers
Cross-Modality Integration: Different omics have distinct feature spaces -
Gromov-Wasserstein compares via pairwise distances instead of direct feature matching
Scalability to Atlases: Linear memory implementations (moscot) enable analysis
of 1.7M+ cell datasets
📈 Major Application Domains
1. Trajectory Inference & Temporal Dynamics
Waddington-OT: Developmental Trajectory Inference
2019CellTemporal OT
Pioneering application of OT to trajectory inference from time-series scRNA-seq. Models
cellular differentiation as probability distributions flowing through expression space,
explicitly accounting for differential growth/death rates.
🔬 How Optimal Transport is Applied:
Unbalanced Kantorovich Formulation: Extends classical OT by adding growth/death terms through KL divergence penalties (λ₁=1, λ₂=50), allowing cell populations to change mass between timepoints.
Entropic Regularization (ε=0.05): Uses Sinkhorn algorithm to efficiently solve OT problems in 30D PCA space, trading exact optimality for computational tractability.
Temporal Coupling Maps: Computes transport couplings π(t,t+Δt) between consecutive timepoints using squared Euclidean distance as the cost function, representing probabilistic cell fate relationships.
Iterative Growth Rate Refinement: Initializes from proliferation/apoptosis gene signatures, then iteratively updates growth rates based on transport maps to ensure consistency with observed cell count changes.
315,000 cells across 39 timepoints during iPSC reprogramming
Revealed diverse developmental programs: stromal, pluripotent, trophoblast, neural
PRESCIENT: Generative Modeling of Cell Trajectories
2021Nature CommGenerative OT
Models differentiation as diffusion process over potential landscape. Uses regularized
Wasserstein loss to fit neural network-parameterized drift and growth terms, enabling
in silico perturbation experiments.
🔬 How Optimal Transport is Applied:
Wasserstein-2 Loss for Neural Network Training: Minimizes W₂² distance between predicted and observed cell distributions using squared Euclidean cost in 50D PCA space.
Potential Landscape Parameterization: Neural networks learn drift F(x) = -∇U(x) + h(x) and growth g(x) terms that minimize the Wasserstein distance to empirical timepoints.
Proliferation Weighting: Each cell weighted by expected number of descendants at final time, computed via forward integration of learned growth function g(x).
Stochastic Sampling for Prediction: Generates cell trajectories via Euler-Maruyama simulation of SDE dx = F(x)dt + σdW, enabling in silico perturbation screens by modifying F or g.
Potential landscape framework connects to Waddington's epigenetic landscape
Incorporates cell proliferation by weighting based on expected descendants
Outperforms alternatives on fate bias prediction with lineage tracing validation
Uses Wasserstein-Fisher-Rao distance to simultaneously capture gene expression velocity
and population growth. Neural ODEs solve high-dimensional OT efficiently with meshless formulation.
🔬 How Optimal Transport is Applied:
Wasserstein-Fisher-Rao (WFR) Metric: Combines transport (W₂) with growth/death (Fisher-Rao) using parameter α to balance contribution: d_WFR = √(W₂² + α·FR²).
Neural ODE Velocity Field: Parameterizes velocity v(x,t) with neural networks, solving continuity equation ∂ₜρ + ∇·(vρ) = g·ρ where g represents growth rate.
Dynamic Formulation (Benamou-Brenier): Minimizes kinetic energy ∫₀ᵀ ∫|v(x,t)|²ρ(x,t)dxdt subject to matching marginals at observed timepoints.
Meshless Particle-Based Solver: Discretizes distributions as weighted particles moving along learned velocity field, scaling to 100K+ cells without spatial discretization.
WFR distance separates transport (Wasserstein) from growth (Fisher-Rao)
Infers time-varying gene regulatory networks via velocity field gradients
Reconstructs未测时间点 expression and cell-cell communication dynamics
Preserves trajectories and pseudo-time with high fidelity
moscot: Unified Framework for Temporal-Spatial Mapping
2025NatureScalable OT
Scalable unified framework supporting temporal, spatial, and spatiotemporal OT with
multimodal integration. Linear memory complexity enables atlas-scale analysis.
🔬 How Optimal Transport is Applied:
Modular Problem Composition: Constructs cost matrices C from multiple sources (expression, spatial coordinates, lineage, time) and combines via C_total = Σᵢ αᵢCᵢ for flexible problem specification.
Low-Rank Sinkhorn (ε=0.005-0.01): Factorizes coupling matrix as π ≈ diag(u)KL^T diag(v) with rank r=50-100, reducing memory from O(n²) to O(nr).
Quadratic Problems (Gromov-Wasserstein): For cross-modality (RNA→ATAC), minimizes ∑ᵢⱼₖₗ (C^X_ik - C^Y_jl)²πᵢⱼπₖₗ using entropic regularization with ε_quad=0.1.
Temporal & Spatial Extensions: Handles temporal couplings with growth models (unbalanced OT with τ=0.5-1.0) and spatial problems using Euclidean/geodesic costs with fused GW (α=0.5).
Processes 1.7M cells (mouse embryogenesis) where competitors fail
10-100× faster with linear memory through online cost evaluation
Supports W-type, GW-type, FGW-type formulations in unified API
Validated NEUROD2 as epsilon cell regulator in human iPSC-islets
SCOT: Gromov-Wasserstein for Multi-Omics Alignment
2020ICML WorkshopGW-OT
First application of Gromov-Wasserstein to single-cell multi-omics. Aligns datasets with
unmatched features by comparing k-NN graph geodesic distances instead of feature values.
🔬 How Optimal Transport is Applied:
Gromov-Wasserstein (GW) for Feature Mismatch: Minimizes ∑ᵢⱼₖₗ |d_X(xᵢ,xₖ) - d_Y(yⱼ,yₗ)|²πᵢⱼπₖₗ where d_X and d_Y are intra-dataset distances, enabling cross-modality alignment (e.g., scRNA-seq ↔ scATAC-seq).
k-NN Graph Geodesic Distances: Constructs k=30 nearest neighbor graphs in each modality, computes shortest path distances as cost matrices C^X and C^Y to capture local manifold geometry.
Entropic GW Solver (ε=0.1): Uses projected gradient descent with Sinkhorn iterations for inner entropic OT subproblems, converging in 100-500 iterations.
Self-Tuning for Unbalanced Data: Employs KL divergence marginal penalties with λ=0.1 to handle different numbers of cells per dataset without strict mass conservation.
15-50× faster than alternatives with only 2 hyperparameters vs 4-5
Lowest FOSCTTM on all 3 simulated datasets
Successfully aligns RNA-ATAC, RNA-methylation without common features
Extends GW-OT to partial alignment scenarios where datasets have distinct cell types.
Virtual points absorb dataset-specific cells while aligning shared structures.
🔬 How Optimal Transport is Applied:
Partial Gromov-Wasserstein (PGW): Extends GW by adding virtual "dust bins" with mass s (0
SPL Curve for Mass Selection: Sweeps s from 0 to 1, plots SPL = ∑ᵢⱼ πᵢⱼ·similarity(i,j). Sharp drop indicates dataset-specific cells; plateau gives optimal shared mass.
Manifold-Preserving Costs: Uses diffusion distance (t=1, k=15 neighbors) to compute C^X and C^Y, capturing global manifold structure beyond local k-NN graphs.
Iterative Refinement with sc-GEM: Alternates between (1) partial GW alignment and (2) guided entry of gene modules to improve biological consistency of coupling.
First method to explicitly handle dataset-specific cell types
SPL curve estimates shared cell number automatically
Alignment Score 0.834-0.907 on partial alignment benchmarks
Identified DNMT3B as key regulator in iPSC reprogramming (sc-GEM)
Incorporates perturbation labels as constraints in GW/COOT formulations. L-fold speedup
by restricting coupling to label-compatible pairs, enabling accurate cross-modality prediction.
🔬 How Optimal Transport is Applied:
Label-Constrained Coupling Set: Defines C^l_{p,q} = {π ∈ C_{p,q} | πᵢⱼ > 0 ⟹ label(xᵢ)=label(yⱼ)}, restricting transport to same-perturbation cells.
Labeled COOT Variant: Extends Co-Optimal Transport by learning global feature transport G with per-label sample transport {πₗ}, combining linear and quadratic OT.
Predictor Training: Uses learned coupling π* to construct barycentric projection: ŷᵢ = ∑ⱼ (πᵢⱼ*/∑ₖπᵢₖ*)yⱼ, then trains regression model f: X→Y on aligned pairs.
Label-compatible coupling: T_ij > 0 only if labels match
O(nm/L) complexity per Sinkhorn iteration vs O(nm)
Predicts RNA from ATAC for 11 kinase inhibitors with dose-response
Enriches true protein-RNA pairs in coupling matrix
Uses debiased Sinkhorn divergence as cell-cell similarity metric. Outperforms Euclidean,
Pearson, and Cosine across 13 datasets (scRNA-seq, scATAC-seq, methylation).
🔬 How Optimal Transport is Applied:
Entropic OT Cell Distance (ε=0.5): For cell pair (l,m), computes W̄_C,ε(aₗ,aₘ) where cells are normalized to probability distributions aᵢ = xᵢ/||xᵢ||₁ over 10K most variable features.
Debiased Sinkhorn Divergence: W̄(a,b) = W_ε(a,b) - [W_ε(a,a) + W_ε(b,b)]/2 ensures zero distance for identical cells and satisfies metric axioms.
Data-Driven Ground Cost: Computes gene-gene distance matrix C using Pearson correlation (RNA-seq/methylation) or Cosine similarity (ATAC-seq), capturing feature relationships.
GPU-Accelerated Sinkhorn: PyTorch implementation with automatic differentiation computes all pairwise distances for 10K cells × 10K features in minutes on single GPU.
Higher C-index and Silhouette scores than all baselines
Particularly strong for overlapping clusters and rare populations
GPU-accelerated PyTorch implementation for scalability
CellOT: Neural Optimal Transport for Perturbations
2023Nature MethodsNeural OT
Uses Input Convex Neural Networks (ICNNs) to learn transport maps between control and
perturbed distributions. Predicts single-cell responses from unpaired data.
🔬 How Optimal Transport is Applied:
Monge Map Parameterization: Learns transport map T_k: ρ_control → ρ_perturbed as T_k(x) = x + ∇g_θ(x) where g_θ is Input Convex Neural Network (ICNN).
Wasserstein-2 Training Objective: Minimizes W₂²(T_k#ρ_c, ρ_k) = 𝔼[||T_k(x) - y||²] over transported control cells and observed perturbed cells.
ICNN Architecture: Ensures g_θ is convex through non-negative weights between layers and convex activation functions, guaranteeing T_k is optimal transport map.
Unpaired Data Handling: No cell-cell correspondence needed; learns population-level transformation from separate control/treated measurements, enabling cross-patient generalization.
ICNNs ensure convexity for guaranteed optimality of learned maps
Handles unpaired data - no matched control/treated cells needed
Predicts individual cell trajectories, not just population means
Generalizes to holdout patients and cross-species transfer
Learns conditional transport plans via flow matching on noise-to-target couplings.
Supports linear, GW, and FGW formulations with unbalanced variants.
🔬 How Optimal Transport is Applied:
Conditional Flow Matching: Learns velocity field v_θ(x,t|x₀,x₁) that transports noise distribution to target via ODE dx/dt = v_θ(x,t), matching entropic OT couplings estimated with Sinkhorn.
Entropic Linear OT (GENOT-L): For trajectory inference, uses W_ε with geodesic cost on cell manifold, preserving biologically plausible paths (TSI metric improvement).
Unbalanced OT (U-GENOT, τ<1): For perturbation with cell death, relaxes mass conservation: marginal constraints become KL(π·1|p) ≤ τ·KL(p|p), allowing learned growth/death.
Fused GW (GENOT-F, α=0.5): For cross-modality (ATAC→RNA), combines feature cost (Euclidean) and structural cost (GW) as C_total = (1-α)C_lin + αC_quad, reducing GW non-uniqueness.
Causal GAN that generates realistic scRNA-seq data while imposing user-defined gene
regulatory networks. Uses Wasserstein GAN framework for stable training.
🔬 How Optimal Transport is Applied:
Wasserstein GAN with Gradient Penalty (WGAN-GP): Trains causal controller to generate TF expression by minimizing Earth Mover's Distance approximated via Kantorovich-Rubinstein duality: W(p_real, p_gen) = sup_||f||_L≤1 𝔼[f(x_real)] - 𝔼[f(x_gen)].
Two-Stage Training: (1) Pre-train WGAN-GP causal controller on TF expressions independently of GRN; (2) Train target generators with architectural constraints per GRN topology.
Critic as Wasserstein Distance Estimator: Lipschitz-constrained critic network provides gradient signal from estimated W-distance between simulated and real data distributions.
Causality via Architecture: Each target gene generator receives only its regulating TFs (per GRN), ensuring causal edges are structurally imposed, not just learned correlations—validated via knockout perturbations.
Bridges simulation-experiment benchmark gap for GRN inference
Architectural constraints impose causal structure (not just correlations)
66.5% of imposed TF-gene edges show significant knockout effects
Preserves cell trajectories, pseudo-time, and biological/technical noise
Data Preprocessing: Normalize to probability distributions (sum to 1) for OT;
log-transform for scRNA-seq, appropriate transform for other modalities
Growth Rate Estimation: Use proliferation/apoptosis signatures or cell cycle scoring;
iterative refinement improves accuracy
Hyperparameter Tuning: Entropy regularization ε (0.001-0.1) trades off between
exact OT and regularized smoothness; use grid search with interpolation validation
Ground Cost Selection: Pearson correlation for RNA-seq, Cosine for ATAC-seq/methylation;
geodesic distances on manifolds when available
Computational Efficiency: Use local PCA (30-50D) instead of full gene space;
GPU acceleration for Sinkhorn iterations; hierarchical coarse-graining for very large datasets
Validation: Geodesic interpolation for trajectory methods; Label Transfer Accuracy
and Alignment Score for integration; compare to simple baselines (mean, linear models)