∫ Optimal Transport & Wasserstein Distance in Single-Cell Biology

Mathematical & Algorithmic Foundations for Cellular Distribution Analysis

A rigorous guide to optimal transport theory and Wasserstein metrics in computational biology. From mathematical foundations (Monge, Kantorovich, entropic regularization) to modern applications in trajectory inference, multi-omics integration, and distributional comparison. Covering landmark methods (Waddington-OT, moscot, SCOT) and theoretical advances (2019-2025).

12
Research Papers
6
Years of Development
1.7M+
Cells Processed
5
Application Domains

🎯 Core Concepts

What is Optimal Transport?

Optimal transport theory provides a mathematical framework for comparing probability distributions by computing the minimal "cost" of transforming one distribution into another. Originally formulated by Gaspard Monge in 1781 and generalized by Leonid Kantorovich in 1942, the theory asks: What is the most efficient way to move a pile of dirt from one location to match a desired shape elsewhere?

The Monge Problem: Moving Mass Optimally

Source μ Initial distribution Optimal Transport Plan min ∫ c(x,y) dπ(x,y) Target ν Target distribution Transport Cost W(μ,ν) = optimal cost • Distance moved • Mass transported • Efficiency metric

Intuition: Each mass point (circle) must be transported to the target distribution along optimal paths (arrows) that minimize total cost.

The Wasserstein Distance

The Wasserstein distance (also called Earth Mover's Distance) quantifies the dissimilarity between two probability distributions as the minimal cost to transport mass from one to the other. For single-cell biology:

Wasserstein Distance: Measuring Distribution Dissimilarity

Similar Distributions Small W-distance ✓ Similar cells/populations Different Distributions Large W-distance ✗ Dissimilar cells/populations

Key Insight: Wasserstein distance is proportional to transport path lengths—small distance for similar distributions, large for dissimilar ones.

Why Optimal Transport for Biology?

Single-Cell Biology Applications

Cell-Cell Similarity Cell A Cell B Gene Expression W(A,B) = cell distance Population Alignment Dataset 1 Coupling π Dataset 2 Match cell populations Trajectory Inference t₀ t₁ t₂ Temporal cell dynamics Cross-Modality Alignment (Gromov-Wasserstein) scRNA-seq Genes: GAPDH, TP53... Compare internal distances, not features directly scATAC-seq Peaks: chr1:1000... Entropic Regularization (Sinkhorn) Classical OT (ε=0) Sparse, O(n³) +entropy Entropic OT (ε>0) Smooth, O(n²log n) Key Advantages for Single-Cell Biology ✓ Handles unpaired data • ✓ Cross-modality integration • ✓ Preserves manifold geometry ✓ Probabilistic interpretations • ✓ Scalable algorithms • ✓ Natural for temporal dynamics

Summary: OT provides a unified mathematical framework for diverse single-cell problems: comparing cells, aligning populations, inferring trajectories, and integrating multi-omics data.

Key Challenges Addressed

📈 Major Application Domains

1. Trajectory Inference & Temporal Dynamics

Waddington-OT: Developmental Trajectory Inference

2019 Cell Temporal OT

Pioneering application of OT to trajectory inference from time-series scRNA-seq. Models cellular differentiation as probability distributions flowing through expression space, explicitly accounting for differential growth/death rates.

🔬 How Optimal Transport is Applied:

Unbalanced Kantorovich Formulation: Extends classical OT by adding growth/death terms through KL divergence penalties (λ₁=1, λ₂=50), allowing cell populations to change mass between timepoints.

Entropic Regularization (ε=0.05): Uses Sinkhorn algorithm to efficiently solve OT problems in 30D PCA space, trading exact optimality for computational tractability.

Temporal Coupling Maps: Computes transport couplings π(t,t+Δt) between consecutive timepoints using squared Euclidean distance as the cost function, representing probabilistic cell fate relationships.

Iterative Growth Rate Refinement: Initializes from proliferation/apoptosis gene signatures, then iteratively updates growth rates based on transport maps to ensure consistency with observed cell count changes.

  • 315,000 cells across 39 timepoints during iPSC reprogramming
  • Revealed diverse developmental programs: stromal, pluripotent, trophoblast, neural
  • Validated predictions: Obox6 TF and GDF9 cytokine enhance reprogramming 2-5 fold
  • Interpolation accuracy near batch-to-batch baseline variation

PRESCIENT: Generative Modeling of Cell Trajectories

2021 Nature Comm Generative OT

Models differentiation as diffusion process over potential landscape. Uses regularized Wasserstein loss to fit neural network-parameterized drift and growth terms, enabling in silico perturbation experiments.

🔬 How Optimal Transport is Applied:

Wasserstein-2 Loss for Neural Network Training: Minimizes W₂² distance between predicted and observed cell distributions using squared Euclidean cost in 50D PCA space.

Potential Landscape Parameterization: Neural networks learn drift F(x) = -∇U(x) + h(x) and growth g(x) terms that minimize the Wasserstein distance to empirical timepoints.

Proliferation Weighting: Each cell weighted by expected number of descendants at final time, computed via forward integration of learned growth function g(x).

Stochastic Sampling for Prediction: Generates cell trajectories via Euler-Maruyama simulation of SDE dx = F(x)dt + σdW, enabling in silico perturbation screens by modifying F or g.

  • Potential landscape framework connects to Waddington's epigenetic landscape
  • Incorporates cell proliferation by weighting based on expected descendants
  • Outperforms alternatives on fate bias prediction with lineage tracing validation
  • Enables large-scale combinatorial perturbation screens

TIGON: Growth and Dynamic Trajectories

2023 Nature Mach Intel WFR Distance

Uses Wasserstein-Fisher-Rao distance to simultaneously capture gene expression velocity and population growth. Neural ODEs solve high-dimensional OT efficiently with meshless formulation.

🔬 How Optimal Transport is Applied:

Wasserstein-Fisher-Rao (WFR) Metric: Combines transport (W₂) with growth/death (Fisher-Rao) using parameter α to balance contribution: d_WFR = √(W₂² + α·FR²).

Neural ODE Velocity Field: Parameterizes velocity v(x,t) with neural networks, solving continuity equation ∂ₜρ + ∇·(vρ) = g·ρ where g represents growth rate.

Dynamic Formulation (Benamou-Brenier): Minimizes kinetic energy ∫₀ᵀ ∫|v(x,t)|²ρ(x,t)dxdt subject to matching marginals at observed timepoints.

Meshless Particle-Based Solver: Discretizes distributions as weighted particles moving along learned velocity field, scaling to 100K+ cells without spatial discretization.

  • WFR distance separates transport (Wasserstein) from growth (Fisher-Rao)
  • Infers time-varying gene regulatory networks via velocity field gradients
  • Reconstructs未测时间点 expression and cell-cell communication dynamics
  • Preserves trajectories and pseudo-time with high fidelity

moscot: Unified Framework for Temporal-Spatial Mapping

2025 Nature Scalable OT

Scalable unified framework supporting temporal, spatial, and spatiotemporal OT with multimodal integration. Linear memory complexity enables atlas-scale analysis.

🔬 How Optimal Transport is Applied:

Modular Problem Composition: Constructs cost matrices C from multiple sources (expression, spatial coordinates, lineage, time) and combines via C_total = Σᵢ αᵢCᵢ for flexible problem specification.

Low-Rank Sinkhorn (ε=0.005-0.01): Factorizes coupling matrix as π ≈ diag(u)KL^T diag(v) with rank r=50-100, reducing memory from O(n²) to O(nr).

Quadratic Problems (Gromov-Wasserstein): For cross-modality (RNA→ATAC), minimizes ∑ᵢⱼₖₗ (C^X_ik - C^Y_jl)²πᵢⱼπₖₗ using entropic regularization with ε_quad=0.1.

Temporal & Spatial Extensions: Handles temporal couplings with growth models (unbalanced OT with τ=0.5-1.0) and spatial problems using Euclidean/geodesic costs with fused GW (α=0.5).

  • Processes 1.7M cells (mouse embryogenesis) where competitors fail
  • 10-100× faster with linear memory through online cost evaluation
  • Supports W-type, GW-type, FGW-type formulations in unified API
  • Validated NEUROD2 as epsilon cell regulator in human iPSC-islets

2. Multi-Omics Integration & Cross-Modality Alignment

SCOT: Gromov-Wasserstein for Multi-Omics Alignment

2020 ICML Workshop GW-OT

First application of Gromov-Wasserstein to single-cell multi-omics. Aligns datasets with unmatched features by comparing k-NN graph geodesic distances instead of feature values.

🔬 How Optimal Transport is Applied:

Gromov-Wasserstein (GW) for Feature Mismatch: Minimizes ∑ᵢⱼₖₗ |d_X(xᵢ,xₖ) - d_Y(yⱼ,yₗ)|²πᵢⱼπₖₗ where d_X and d_Y are intra-dataset distances, enabling cross-modality alignment (e.g., scRNA-seq ↔ scATAC-seq).

k-NN Graph Geodesic Distances: Constructs k=30 nearest neighbor graphs in each modality, computes shortest path distances as cost matrices C^X and C^Y to capture local manifold geometry.

Entropic GW Solver (ε=0.1): Uses projected gradient descent with Sinkhorn iterations for inner entropic OT subproblems, converging in 100-500 iterations.

Self-Tuning for Unbalanced Data: Employs KL divergence marginal penalties with λ=0.1 to handle different numbers of cells per dataset without strict mass conservation.

  • 15-50× faster than alternatives with only 2 hyperparameters vs 4-5
  • Lowest FOSCTTM on all 3 simulated datasets
  • Successfully aligns RNA-ATAC, RNA-methylation without common features
  • Probabilistic coupling matrix enables uncertainty-aware downstream analysis

Pamona: Partial Manifold Alignment

2022 Bioinformatics Partial GW

Extends GW-OT to partial alignment scenarios where datasets have distinct cell types. Virtual points absorb dataset-specific cells while aligning shared structures.

🔬 How Optimal Transport is Applied:

Partial Gromov-Wasserstein (PGW): Extends GW by adding virtual "dust bins" with mass s (0

SPL Curve for Mass Selection: Sweeps s from 0 to 1, plots SPL = ∑ᵢⱼ πᵢⱼ·similarity(i,j). Sharp drop indicates dataset-specific cells; plateau gives optimal shared mass.

Manifold-Preserving Costs: Uses diffusion distance (t=1, k=15 neighbors) to compute C^X and C^Y, capturing global manifold structure beyond local k-NN graphs.

Iterative Refinement with sc-GEM: Alternates between (1) partial GW alignment and (2) guided entry of gene modules to improve biological consistency of coupling.

  • First method to explicitly handle dataset-specific cell types
  • SPL curve estimates shared cell number automatically
  • Alignment Score 0.834-0.907 on partial alignment benchmarks
  • Identified DNMT3B as key regulator in iPSC reprogramming (sc-GEM)

Labeled GWOT: Cross-Modality Perturbation Prediction

2025 AISTATS Label-Aware OT

Incorporates perturbation labels as constraints in GW/COOT formulations. L-fold speedup by restricting coupling to label-compatible pairs, enabling accurate cross-modality prediction.

🔬 How Optimal Transport is Applied:

Label-Constrained Coupling Set: Defines C^l_{p,q} = {π ∈ C_{p,q} | πᵢⱼ > 0 ⟹ label(xᵢ)=label(yⱼ)}, restricting transport to same-perturbation cells.

Entropic Labeled GWOT (ε=0.05): Solves min_π ∑ᵢⱼₖₗ |C^X_ik - C^Y_jl|²πᵢⱼπₖₗ + εH(π) s.t. π ∈ C^l, achieving L-fold speedup where L = # labels.

Labeled COOT Variant: Extends Co-Optimal Transport by learning global feature transport G with per-label sample transport {πₗ}, combining linear and quadratic OT.

Predictor Training: Uses learned coupling π* to construct barycentric projection: ŷᵢ = ∑ⱼ (πᵢⱼ*/∑ₖπᵢₖ*)yⱼ, then trains regression model f: X→Y on aligned pairs.

  • Label-compatible coupling: T_ij > 0 only if labels match
  • O(nm/L) complexity per Sinkhorn iteration vs O(nm)
  • Predicts RNA from ATAC for 11 kinase inhibitors with dose-response
  • Enriches true protein-RNA pairs in coupling matrix

3. Cell Similarity & Clustering

OT-scOmics: Improved Cell-Cell Similarity

2022 Bioinformatics Sinkhorn Divergence

Uses debiased Sinkhorn divergence as cell-cell similarity metric. Outperforms Euclidean, Pearson, and Cosine across 13 datasets (scRNA-seq, scATAC-seq, methylation).

🔬 How Optimal Transport is Applied:

Entropic OT Cell Distance (ε=0.5): For cell pair (l,m), computes W̄_C,ε(aₗ,aₘ) where cells are normalized to probability distributions aᵢ = xᵢ/||xᵢ||₁ over 10K most variable features.

Debiased Sinkhorn Divergence: W̄(a,b) = W_ε(a,b) - [W_ε(a,a) + W_ε(b,b)]/2 ensures zero distance for identical cells and satisfies metric axioms.

Data-Driven Ground Cost: Computes gene-gene distance matrix C using Pearson correlation (RNA-seq/methylation) or Cosine similarity (ATAC-seq), capturing feature relationships.

GPU-Accelerated Sinkhorn: PyTorch implementation with automatic differentiation computes all pairwise distances for 10K cells × 10K features in minutes on single GPU.

  • Higher C-index and Silhouette scores than all baselines
  • Particularly strong for overlapping clusters and rare populations
  • GPU-accelerated PyTorch implementation for scalability
  • Generalizes across modalities without adaptation

4. Neural Optimal Transport & Deep Learning

CellOT: Neural Optimal Transport for Perturbations

2023 Nature Methods Neural OT

Uses Input Convex Neural Networks (ICNNs) to learn transport maps between control and perturbed distributions. Predicts single-cell responses from unpaired data.

🔬 How Optimal Transport is Applied:

Monge Map Parameterization: Learns transport map T_k: ρ_control → ρ_perturbed as T_k(x) = x + ∇g_θ(x) where g_θ is Input Convex Neural Network (ICNN).

Wasserstein-2 Training Objective: Minimizes W₂²(T_k#ρ_c, ρ_k) = 𝔼[||T_k(x) - y||²] over transported control cells and observed perturbed cells.

ICNN Architecture: Ensures g_θ is convex through non-negative weights between layers and convex activation functions, guaranteeing T_k is optimal transport map.

Unpaired Data Handling: No cell-cell correspondence needed; learns population-level transformation from separate control/treated measurements, enabling cross-patient generalization.

  • ICNNs ensure convexity for guaranteed optimality of learned maps
  • Handles unpaired data - no matched control/treated cells needed
  • Predicts individual cell trajectories, not just population means
  • Generalizes to holdout patients and cross-species transfer

GENOT: Entropic (Gromov)-Wasserstein Flow Matching

2024 NeurIPS Flow Matching

Learns conditional transport plans via flow matching on noise-to-target couplings. Supports linear, GW, and FGW formulations with unbalanced variants.

🔬 How Optimal Transport is Applied:

Conditional Flow Matching: Learns velocity field v_θ(x,t|x₀,x₁) that transports noise distribution to target via ODE dx/dt = v_θ(x,t), matching entropic OT couplings estimated with Sinkhorn.

Entropic Linear OT (GENOT-L): For trajectory inference, uses W_ε with geodesic cost on cell manifold, preserving biologically plausible paths (TSI metric improvement).

Unbalanced OT (U-GENOT, τ<1): For perturbation with cell death, relaxes mass conservation: marginal constraints become KL(π·1|p) ≤ τ·KL(p|p), allowing learned growth/death.

Fused GW (GENOT-F, α=0.5): For cross-modality (ATAC→RNA), combines feature cost (Euclidean) and structural cost (GW) as C_total = (1-α)C_lin + αC_quad, reducing GW non-uniqueness.

  • Stochastic trajectories capture biological variability
  • Arbitrary cost functions (geodesic, graph-based) for domain-specific priors
  • Unbalanced mass support for noisy/growing cell populations
  • 50% improvement in perturbation discrimination on Tahoe-100M

GRouNdGAN: GRN-Guided Simulation with Causal GANs

2024 Nature Comm Simulation

Causal GAN that generates realistic scRNA-seq data while imposing user-defined gene regulatory networks. Uses Wasserstein GAN framework for stable training.

🔬 How Optimal Transport is Applied:

Wasserstein GAN with Gradient Penalty (WGAN-GP): Trains causal controller to generate TF expression by minimizing Earth Mover's Distance approximated via Kantorovich-Rubinstein duality: W(p_real, p_gen) = sup_||f||_L≤1 𝔼[f(x_real)] - 𝔼[f(x_gen)].

Two-Stage Training: (1) Pre-train WGAN-GP causal controller on TF expressions independently of GRN; (2) Train target generators with architectural constraints per GRN topology.

Critic as Wasserstein Distance Estimator: Lipschitz-constrained critic network provides gradient signal from estimated W-distance between simulated and real data distributions.

Causality via Architecture: Each target gene generator receives only its regulating TFs (per GRN), ensuring causal edges are structurally imposed, not just learned correlations—validated via knockout perturbations.

  • Bridges simulation-experiment benchmark gap for GRN inference
  • Architectural constraints impose causal structure (not just correlations)
  • 66.5% of imposed TF-gene edges show significant knockout effects
  • Preserves cell trajectories, pseudo-time, and biological/technical noise

📚 Method Comparison

Method Year OT Type Primary Use Case Key Innovation Scale
Waddington-OT 2019 Unbalanced W Trajectory Inference Temporal coupling with growth modeling 315K cells, 39 timepoints
SCOT 2020 GW Multi-Omics Alignment k-NN graph geodesic distances ~1K cells
PRESCIENT 2021 Regularized W Generative Trajectories Potential landscape + perturbations ~100K cells
Pamona 2022 Partial GW Partial Alignment Virtual points for dataset-specific cells ~5K cells
OT-scOmics 2022 Sinkhorn Div Cell Similarity Superior clustering metric ~10K cells
CellOT 2023 Neural W Perturbation Prediction ICNN-parameterized transport maps ~50K cells
TIGON 2023 WFR Growth + Trajectories Separates transport from growth dynamics ~5K cells
GENOT 2024 Flow Match W/GW/FGW Stochastic Trajectories Conditional flows with arbitrary costs ~100K cells
GRouNdGAN 2024 WGAN GRN Simulation Causal GAN with regulatory constraints ~100K cells
Labeled GWOT 2025 Label-GW/COOT Cross-Modality Prediction Label constraints for L-fold speedup ~10K cells
moscot 2025 W/GW/FGW Atlas-Scale Mapping Linear memory, unified framework 1.7M cells

💡 Practical Implementation Guide

Common Pitfalls & Best Practices

Software Ecosystem

📖 Learning Resources

Theoretical Foundations

Tutorials & Code Examples

Benchmark Datasets