Optimal Transport & Wasserstein Distance in Single-Cell Biology

🎯 Core Concepts

What is Optimal Transport?

Optimal transport theory provides a mathematical framework for comparing probability distributions by computing the minimal "cost" of transforming one distribution into another. Originally formulated by Gaspard Monge in 1781 and generalized by Leonid Kantorovich in 1942, the theory asks: What is the most efficient way to move a pile of dirt from one location to match a desired shape elsewhere?

The Monge Problem: Moving Mass Optimally

Intuition: Each mass point (circle) must be transported to the target distribution along optimal paths (arrows) that minimize total cost.

The Wasserstein Distance

The Wasserstein distance (also called Earth Mover's Distance) quantifies the dissimilarity between two probability distributions as the minimal cost to transport mass from one to the other. For single-cell biology:

Wasserstein Distance: Measuring Distribution Dissimilarity

Key Insight: Wasserstein distance is proportional to transport path lengths—small distance for similar distributions, large for dissimilar ones.

Cell-Cell Similarity: Treating gene expression profiles as probability distributions, Wasserstein distance captures how much "work" is needed to transform one cell's expression into another's
Population Alignment: Compares entire cell populations by finding optimal couplings that preserve local neighborhood structure
Trajectory Inference: Models cellular differentiation as probability distributions flowing through gene expression space over time

Why Optimal Transport for Biology?

Natural Framework: Biological processes (cell differentiation, population dynamics) naturally involve mass transport with growth and death
Handles Unpaired Data: Unlike supervised methods, OT aligns datasets without requiring matched cells or shared features
Preserves Structure: Gromov-Wasserstein formulation compares metric spaces directly, preserving manifold geometry across modalities
Probabilistic Interpretation: Coupling matrices provide interpretable cell-cell correspondence probabilities for downstream analysis

Single-Cell Biology Applications

Summary: OT provides a unified mathematical framework for diverse single-cell problems: comparing cells, aligning populations, inferring trajectories, and integrating multi-omics data.

Key Challenges Addressed

Computational Tractability: Classical OT is computationally expensive (O(n³)); entropic regularization (Sinkhorn algorithm) reduces complexity to O(n² log n)
Unbalanced Mass: Biological systems have growth/death - unbalanced OT with KL penalties accommodates varying cell numbers
Cross-Modality Integration: Different omics have distinct feature spaces - Gromov-Wasserstein compares via pairwise distances instead of direct feature matching
Scalability to Atlases: Linear memory implementations (moscot) enable analysis of 1.7M+ cell datasets

📈 Major Application Domains

1. Trajectory Inference & Temporal Dynamics

Waddington-OT: Developmental Trajectory Inference

2019 Cell Temporal OT

Pioneering application of OT to trajectory inference from time-series scRNA-seq. Models cellular differentiation as probability distributions flowing through expression space, explicitly accounting for differential growth/death rates.

🔬 How Optimal Transport is Applied:

Unbalanced Kantorovich Formulation: Extends classical OT by adding growth/death terms through KL divergence penalties (λ₁=1, λ₂=50), allowing cell populations to change mass between timepoints.

Entropic Regularization (ε=0.05): Uses Sinkhorn algorithm to efficiently solve OT problems in 30D PCA space, trading exact optimality for computational tractability.

Temporal Coupling Maps: Computes transport couplings π(t,t+Δt) between consecutive timepoints using squared Euclidean distance as the cost function, representing probabilistic cell fate relationships.

Iterative Growth Rate Refinement: Initializes from proliferation/apoptosis gene signatures, then iteratively updates growth rates based on transport maps to ensure consistency with observed cell count changes.

315,000 cells across 39 timepoints during iPSC reprogramming
Revealed diverse developmental programs: stromal, pluripotent, trophoblast, neural
Validated predictions: Obox6 TF and GDF9 cytokine enhance reprogramming 2-5 fold
Interpolation accuracy near batch-to-batch baseline variation

Paper Code

PRESCIENT: Generative Modeling of Cell Trajectories

2021 Nature Comm Generative OT

Models differentiation as diffusion process over potential landscape. Uses regularized Wasserstein loss to fit neural network-parameterized drift and growth terms, enabling in silico perturbation experiments.

🔬 How Optimal Transport is Applied:

Wasserstein-2 Loss for Neural Network Training: Minimizes W₂² distance between predicted and observed cell distributions using squared Euclidean cost in 50D PCA space.

Potential Landscape Parameterization: Neural networks learn drift F(x) = -∇U(x) + h(x) and growth g(x) terms that minimize the Wasserstein distance to empirical timepoints.

Proliferation Weighting: Each cell weighted by expected number of descendants at final time, computed via forward integration of learned growth function g(x).

Stochastic Sampling for Prediction: Generates cell trajectories via Euler-Maruyama simulation of SDE dx = F(x)dt + σdW, enabling in silico perturbation screens by modifying F or g.

Potential landscape framework connects to Waddington's epigenetic landscape
Incorporates cell proliferation by weighting based on expected descendants
Outperforms alternatives on fate bias prediction with lineage tracing validation
Enables large-scale combinatorial perturbation screens

Paper Code

TIGON: Growth and Dynamic Trajectories

2023 Nature Mach Intel WFR Distance

Uses Wasserstein-Fisher-Rao distance to simultaneously capture gene expression velocity and population growth. Neural ODEs solve high-dimensional OT efficiently with meshless formulation.

🔬 How Optimal Transport is Applied:

Wasserstein-Fisher-Rao (WFR) Metric: Combines transport (W₂) with growth/death (Fisher-Rao) using parameter α to balance contribution: d_WFR = √(W₂² + α·FR²).

Neural ODE Velocity Field: Parameterizes velocity v(x,t) with neural networks, solving continuity equation ∂ₜρ + ∇·(vρ) = g·ρ where g represents growth rate.

Dynamic Formulation (Benamou-Brenier): Minimizes kinetic energy ∫₀ᵀ ∫|v(x,t)|²ρ(x,t)dxdt subject to matching marginals at observed timepoints.

Meshless Particle-Based Solver: Discretizes distributions as weighted particles moving along learned velocity field, scaling to 100K+ cells without spatial discretization.

WFR distance separates transport (Wasserstein) from growth (Fisher-Rao)
Infers time-varying gene regulatory networks via velocity field gradients
Reconstructs未测时间点 expression and cell-cell communication dynamics
Preserves trajectories and pseudo-time with high fidelity

Paper Code

moscot: Unified Framework for Temporal-Spatial Mapping

2025 Nature Scalable OT

Scalable unified framework supporting temporal, spatial, and spatiotemporal OT with multimodal integration. Linear memory complexity enables atlas-scale analysis.

🔬 How Optimal Transport is Applied:

Modular Problem Composition: Constructs cost matrices C from multiple sources (expression, spatial coordinates, lineage, time) and combines via C_total = Σᵢ αᵢCᵢ for flexible problem specification.

Low-Rank Sinkhorn (ε=0.005-0.01): Factorizes coupling matrix as π ≈ diag(u)KL^T diag(v) with rank r=50-100, reducing memory from O(n²) to O(nr).

Quadratic Problems (Gromov-Wasserstein): For cross-modality (RNA→ATAC), minimizes ∑ᵢⱼₖₗ (C^X_ik - C^Y_jl)²πᵢⱼπₖₗ using entropic regularization with ε_quad=0.1.

Temporal & Spatial Extensions: Handles temporal couplings with growth models (unbalanced OT with τ=0.5-1.0) and spatial problems using Euclidean/geodesic costs with fused GW (α=0.5).

Processes 1.7M cells (mouse embryogenesis) where competitors fail
10-100× faster with linear memory through online cost evaluation
Supports W-type, GW-type, FGW-type formulations in unified API
Validated NEUROD2 as epsilon cell regulator in human iPSC-islets

Paper Website

2. Multi-Omics Integration & Cross-Modality Alignment

SCOT: Gromov-Wasserstein for Multi-Omics Alignment

2020 ICML Workshop GW-OT

First application of Gromov-Wasserstein to single-cell multi-omics. Aligns datasets with unmatched features by comparing k-NN graph geodesic distances instead of feature values.

🔬 How Optimal Transport is Applied:

Gromov-Wasserstein (GW) for Feature Mismatch: Minimizes ∑ᵢⱼₖₗ |d_X(xᵢ,xₖ) - d_Y(yⱼ,yₗ)|²πᵢⱼπₖₗ where d_X and d_Y are intra-dataset distances, enabling cross-modality alignment (e.g., scRNA-seq ↔ scATAC-seq).

k-NN Graph Geodesic Distances: Constructs k=30 nearest neighbor graphs in each modality, computes shortest path distances as cost matrices C^X and C^Y to capture local manifold geometry.

Entropic GW Solver (ε=0.1): Uses projected gradient descent with Sinkhorn iterations for inner entropic OT subproblems, converging in 100-500 iterations.

Self-Tuning for Unbalanced Data: Employs KL divergence marginal penalties with λ=0.1 to handle different numbers of cells per dataset without strict mass conservation.

15-50× faster than alternatives with only 2 hyperparameters vs 4-5
Lowest FOSCTTM on all 3 simulated datasets
Successfully aligns RNA-ATAC, RNA-methylation without common features
Probabilistic coupling matrix enables uncertainty-aware downstream analysis

Code

Pamona: Partial Manifold Alignment

2022 Bioinformatics Partial GW

Extends GW-OT to partial alignment scenarios where datasets have distinct cell types. Virtual points absorb dataset-specific cells while aligning shared structures.

🔬 How Optimal Transport is Applied:

Partial Gromov-Wasserstein (PGW): Extends GW by adding virtual "dust bins" with mass s (0

SPL Curve for Mass Selection: Sweeps s from 0 to 1, plots SPL = ∑ᵢⱼ πᵢⱼ·similarity(i,j). Sharp drop indicates dataset-specific cells; plateau gives optimal shared mass.

Manifold-Preserving Costs: Uses diffusion distance (t=1, k=15 neighbors) to compute C^X and C^Y, capturing global manifold structure beyond local k-NN graphs.

Iterative Refinement with sc-GEM: Alternates between (1) partial GW alignment and (2) guided entry of gene modules to improve biological consistency of coupling.

First method to explicitly handle dataset-specific cell types
SPL curve estimates shared cell number automatically
Alignment Score 0.834-0.907 on partial alignment benchmarks
Identified DNMT3B as key regulator in iPSC reprogramming (sc-GEM)

Paper Code

Labeled GWOT: Cross-Modality Perturbation Prediction

2025 AISTATS Label-Aware OT

Incorporates perturbation labels as constraints in GW/COOT formulations. L-fold speedup by restricting coupling to label-compatible pairs, enabling accurate cross-modality prediction.

🔬 How Optimal Transport is Applied:

Label-Constrained Coupling Set: Defines C^l_{p,q} = {π ∈ C_{p,q} | πᵢⱼ > 0 ⟹ label(xᵢ)=label(yⱼ)}, restricting transport to same-perturbation cells.

Entropic Labeled GWOT (ε=0.05): Solves min_π ∑ᵢⱼₖₗ |C^X_ik - C^Y_jl|²πᵢⱼπₖₗ + εH(π) s.t. π ∈ C^l, achieving L-fold speedup where L = # labels.

Labeled COOT Variant: Extends Co-Optimal Transport by learning global feature transport G with per-label sample transport {πₗ}, combining linear and quadratic OT.

Predictor Training: Uses learned coupling π* to construct barycentric projection: ŷᵢ = ∑ⱼ (πᵢⱼ*/∑ₖπᵢₖ*)yⱼ, then trains regression model f: X→Y on aligned pairs.

Label-compatible coupling: T_ij > 0 only if labels match
O(nm/L) complexity per Sinkhorn iteration vs O(nm)
Predicts RNA from ATAC for 11 kinase inhibitors with dose-response
Enriches true protein-RNA pairs in coupling matrix

Website

3. Cell Similarity & Clustering

OT-scOmics: Improved Cell-Cell Similarity

2022 Bioinformatics Sinkhorn Divergence

Uses debiased Sinkhorn divergence as cell-cell similarity metric. Outperforms Euclidean, Pearson, and Cosine across 13 datasets (scRNA-seq, scATAC-seq, methylation).

🔬 How Optimal Transport is Applied:

Entropic OT Cell Distance (ε=0.5): For cell pair (l,m), computes W̄_C,ε(aₗ,aₘ) where cells are normalized to probability distributions aᵢ = xᵢ/||xᵢ||₁ over 10K most variable features.

Debiased Sinkhorn Divergence: W̄(a,b) = W_ε(a,b) - [W_ε(a,a) + W_ε(b,b)]/2 ensures zero distance for identical cells and satisfies metric axioms.

Data-Driven Ground Cost: Computes gene-gene distance matrix C using Pearson correlation (RNA-seq/methylation) or Cosine similarity (ATAC-seq), capturing feature relationships.

GPU-Accelerated Sinkhorn: PyTorch implementation with automatic differentiation computes all pairwise distances for 10K cells × 10K features in minutes on single GPU.

Higher C-index and Silhouette scores than all baselines
Particularly strong for overlapping clusters and rare populations
GPU-accelerated PyTorch implementation for scalability
Generalizes across modalities without adaptation

Paper Code

4. Neural Optimal Transport & Deep Learning

CellOT: Neural Optimal Transport for Perturbations

2023 Nature Methods Neural OT

Uses Input Convex Neural Networks (ICNNs) to learn transport maps between control and perturbed distributions. Predicts single-cell responses from unpaired data.

🔬 How Optimal Transport is Applied:

Monge Map Parameterization: Learns transport map T_k: ρ_control → ρ_perturbed as T_k(x) = x + ∇g_θ(x) where g_θ is Input Convex Neural Network (ICNN).

Wasserstein-2 Training Objective: Minimizes W₂²(T_k#ρ_c, ρ_k) = 𝔼[||T_k(x) - y||²] over transported control cells and observed perturbed cells.

ICNN Architecture: Ensures g_θ is convex through non-negative weights between layers and convex activation functions, guaranteeing T_k is optimal transport map.

Unpaired Data Handling: No cell-cell correspondence needed; learns population-level transformation from separate control/treated measurements, enabling cross-patient generalization.

ICNNs ensure convexity for guaranteed optimality of learned maps
Handles unpaired data - no matched control/treated cells needed
Predicts individual cell trajectories, not just population means
Generalizes to holdout patients and cross-species transfer

Paper Code

GENOT: Entropic (Gromov)-Wasserstein Flow Matching

2024 NeurIPS Flow Matching

Learns conditional transport plans via flow matching on noise-to-target couplings. Supports linear, GW, and FGW formulations with unbalanced variants.

🔬 How Optimal Transport is Applied:

Conditional Flow Matching: Learns velocity field v_θ(x,t|x₀,x₁) that transports noise distribution to target via ODE dx/dt = v_θ(x,t), matching entropic OT couplings estimated with Sinkhorn.

Entropic Linear OT (GENOT-L): For trajectory inference, uses W_ε with geodesic cost on cell manifold, preserving biologically plausible paths (TSI metric improvement).

Unbalanced OT (U-GENOT, τ<1): For perturbation with cell death, relaxes mass conservation: marginal constraints become KL(π·1|p) ≤ τ·KL(p|p), allowing learned growth/death.

Fused GW (GENOT-F, α=0.5): For cross-modality (ATAC→RNA), combines feature cost (Euclidean) and structural cost (GW) as C_total = (1-α)C_lin + αC_quad, reducing GW non-uniqueness.

Stochastic trajectories capture biological variability
Arbitrary cost functions (geodesic, graph-based) for domain-specific priors
Unbalanced mass support for noisy/growing cell populations
50% improvement in perturbation discrimination on Tahoe-100M

Paper Code

GRouNdGAN: GRN-Guided Simulation with Causal GANs

2024 Nature Comm Simulation

Causal GAN that generates realistic scRNA-seq data while imposing user-defined gene regulatory networks. Uses Wasserstein GAN framework for stable training.

🔬 How Optimal Transport is Applied:

Wasserstein GAN with Gradient Penalty (WGAN-GP): Trains causal controller to generate TF expression by minimizing Earth Mover's Distance approximated via Kantorovich-Rubinstein duality: W(p_real, p_gen) = sup_||f||_L≤1 𝔼[f(x_real)] - 𝔼[f(x_gen)].

Two-Stage Training: (1) Pre-train WGAN-GP causal controller on TF expressions independently of GRN; (2) Train target generators with architectural constraints per GRN topology.

Critic as Wasserstein Distance Estimator: Lipschitz-constrained critic network provides gradient signal from estimated W-distance between simulated and real data distributions.

Causality via Architecture: Each target gene generator receives only its regulating TFs (per GRN), ensuring causal edges are structurally imposed, not just learned correlations—validated via knockout perturbations.

Bridges simulation-experiment benchmark gap for GRN inference
Architectural constraints impose causal structure (not just correlations)
66.5% of imposed TF-gene edges show significant knockout effects
Preserves cell trajectories, pseudo-time, and biological/technical noise

Paper

📚 Method Comparison

Method	Year	OT Type	Primary Use Case	Key Innovation	Scale
Waddington-OT	2019	Unbalanced W	Trajectory Inference	Temporal coupling with growth modeling	315K cells, 39 timepoints
SCOT	2020	GW	Multi-Omics Alignment	k-NN graph geodesic distances	~1K cells
PRESCIENT	2021	Regularized W	Generative Trajectories	Potential landscape + perturbations	~100K cells
Pamona	2022	Partial GW	Partial Alignment	Virtual points for dataset-specific cells	~5K cells
OT-scOmics	2022	Sinkhorn Div	Cell Similarity	Superior clustering metric	~10K cells
CellOT	2023	Neural W	Perturbation Prediction	ICNN-parameterized transport maps	~50K cells
TIGON	2023	WFR	Growth + Trajectories	Separates transport from growth dynamics	~5K cells
GENOT	2024	Flow Match W/GW/FGW	Stochastic Trajectories	Conditional flows with arbitrary costs	~100K cells
GRouNdGAN	2024	WGAN	GRN Simulation	Causal GAN with regulatory constraints	~100K cells
Labeled GWOT	2025	Label-GW/COOT	Cross-Modality Prediction	Label constraints for L-fold speedup	~10K cells
moscot	2025	W/GW/FGW	Atlas-Scale Mapping	Linear memory, unified framework	1.7M cells

💡 Practical Implementation Guide

Common Pitfalls & Best Practices

Data Preprocessing: Normalize to probability distributions (sum to 1) for OT; log-transform for scRNA-seq, appropriate transform for other modalities
Growth Rate Estimation: Use proliferation/apoptosis signatures or cell cycle scoring; iterative refinement improves accuracy
Hyperparameter Tuning: Entropy regularization ε (0.001-0.1) trades off between exact OT and regularized smoothness; use grid search with interpolation validation
Ground Cost Selection: Pearson correlation for RNA-seq, Cosine for ATAC-seq/methylation; geodesic distances on manifolds when available
Computational Efficiency: Use local PCA (30-50D) instead of full gene space; GPU acceleration for Sinkhorn iterations; hierarchical coarse-graining for very large datasets
Validation: Geodesic interpolation for trajectory methods; Label Transfer Accuracy and Alignment Score for integration; compare to simple baselines (mean, linear models)

Software Ecosystem

POT (Python Optimal Transport): Core library with Sinkhorn, GW, unbalanced variants
Waddington-OT: Trajectory inference from time-series
moscot: Unified framework for temporal/spatial/spatiotemporal OT at scale
SCOT: Multi-omics alignment with GW
Pamona: Partial alignment for heterogeneous datasets
CellOT: Neural OT for perturbation prediction

📖 Learning Resources

Theoretical Foundations

Villani (2008) "Optimal Transport: Old and New" - Comprehensive mathematical treatment (Chapters 1-2, 6-7)
Peyré & Cuturi (2019) "Computational Optimal Transport" - Modern algorithmic perspective with code
Mémoli (2011) "Gromov-Wasserstein Distances" - Foundation for comparing metric spaces
Chapel et al. (2020) "Partial Optimal Transport" - Theory of unbalanced mass with virtual points

Tutorials & Code Examples

POT Library Tutorials: Sinkhorn, GW, unbalanced OT with executable notebooks
Waddington-OT Documentation: Step-by-step trajectory inference pipeline
moscot Tutorials: Temporal, spatial, and spatiotemporal mapping workflows
Scanpy Tutorials: Single-cell preprocessing for OT methods (normalization, PCA, k-NN graphs)

Benchmark Datasets

iPSC Reprogramming (GSE122662): ~65K cells, 10 timepoints over 16 days - Waddington-OT landmark dataset