🧬 TranscriptFormer Dual Decoder Heads

Understanding how the model jointly predicts genes and their expression counts

Core architecture component for generative single-cell modeling

Transformer
Encoder

zj(L)

→

Gene
Decoder

Categorical

Count
Decoder

Zero-Truncated
Poisson

🎯 Gene Decoder Head (Categorical)

Purpose: Predicts which gene to select next in the cell sentence

ωj = softmax(MLPω(zj(L)))

Key Properties:

  • Output: Probability distribution over entire gene vocabulary (~25K-247K genes)
  • Architecture: Two-layer MLP + softmax normalization
  • Loss: Standard categorical cross-entropy
  • Context-aware: Depends on all previously selected genes

Example Output:

GAPDH: 0.23 ACTB: 0.18 TP53: 0.12 Others: 0.47

📊 Count Decoder Head (Zero-Truncated Poisson)

Purpose: Predicts expression level (count) for the selected gene

P(cj | gj, context) = ZTP(λj)

Key Properties:

  • Output: Expected count value (always > 0)
  • Architecture: MLP with normalization to total count
  • Loss: Zero-truncated Poisson negative log-likelihood
  • Constraint: Counts sum to observed total transcripts in cell

Example Output:

GAPDH: 156 ACTB: 89 TP53: 23

🔗 Joint Training & Coupling

Why Both Heads Are Essential:

L = Lgene + Lcount
  • Sequential Dependency: Count decoder uses gene identity from gene decoder
  • Biological Realism: Models the fact that different genes have different typical expression levels
  • Generative Capability: Can sample both gene identity and count jointly
  • Context Sensitivity: Both decoders condition on cell context

Innovation: Unlike discriminative models that only classify, this architecture can generate realistic cell profiles by sampling from both distributions sequentially.

🎮 Interactive Generation Demo

See how the dual decoders work together to generate a cell sentence

Cell Sentence Generation:
🧬 [START] →

🔬 Why Zero-Truncated Poisson?

Biological Motivation:

  • Gene expression counts are discrete, non-negative integers
  • Poisson distribution naturally models count data
  • Zero-truncation ensures only expressed genes appear in sequences
  • Captures overdispersion common in single-cell data

Mathematical Form:

P(c = k | λ) = (λk e-λ) / (k!(1 - e-λ)), k ≥ 1

âš¡ Computational Advantages

Efficiency Benefits:

  • Shared Encoder: Single transformer processes both tasks
  • Parallel Computation: Both losses computed simultaneously
  • Parameter Sharing: Contextualized representations used by both heads
  • End-to-End Training: Joint optimization improves both tasks

vs. Separate Models: 2x more efficient than training gene and count predictors separately