Tokenization Strategies
Translating Biology into AI-Readable Language
Understanding how biological data is converted into tokens is fundamental to building effective foundation models. This guide explores the major tokenization strategies used in single-cell multi-omics, protein modeling, and DNA sequence analysis, using intuitive LEGO analogies to make these concepts accessible.
The Problem: Biology is Messy, AI Needs Order
The LEGO Analogy: Imagine a single cell as a giant bucket full of unsorted LEGO bricks.
Each type of brick (color/shape) represents a different Gene.
The number of bricks of that specific type represents its Expression Level (how active that gene is).
An AI model (like a Transformer) is a master builder, but it can't just grab a handful from the messy bucket. It needs the bricks sorted, labeled, and handed to it in a specific sequence. Tokenization is the process of organizing that messy bucket into a neat line of inputs the builder can use.
Depending on what we want the model to learn, we use different strategies to sort and present these "bricks".
Visual Guide to Core Tokenization Concepts
Before diving into the details, let's understand the core concepts visually using our LEGO analogy.
The Basics: Identity & Count
A token is a single unit of information for the model. In single-cell data, we usually need two pieces of information combined:
- What is it? (The Gene Identity, e.g., "TP53" or a red brick).
- How much? (The Expression Value, e.g., "50 counts" or a stack of 50 bricks).
The challenge is how to combine these two very different types of information into a single vector representing the token.
Strategy: Rank-Based Tokenization
The Problem: Sometimes we have "technical noise" (batch effects). One experiment might yield giant stacks of bricks (high sequencing depth), and another yields tiny stacks, even for the same biological cell type.
The Solution: Ignore the exact height. Just line them up from tallest to shortest. As long as the relative order is preserved (Red is taller than Blue), the resulting sequence of tokens is the same, making the model immune to batch differences in sequencing depth.
Strategy: Expression Binning
The Problem: Standard language models work best with a fixed dictionary of words (categorical data). Actual expression counts are continuous numbers (1, 2, 50, 1000...).
The Solution: We create buckets (bins) for different ranges of heights. Instead of saying "Height 45", we throw it into the "Medium Height Bucket". Now, the token isn't a number, it's a category: [GeneID] + [MediumBin].
Strategy: Read-Depth-Aware (RDA) Tokenization
The Problem: Different cells have different sequencing depths—some have 10,000 total counts, others only 1,000. This makes expression values hard to compare directly.
The Solution: Keep the continuous expression values (don't discretize!), but add special "depth tokens" that tell the model: "This cell was supposed to have T=10,000 counts, but we only sampled S=1,000." The model learns to mentally "scale up" the values during pretraining.
Key Benefit: Enables expression enhancement—the model can predict what gene expression would look like at higher sequencing depth, effectively denoising sparse data.
Strategy: Genome-Coordinate Tokenization
The Problem: In chromatin data (ATAC-seq), we don't have predefined "genes". We just have regions on the genome that are "open" (accessible). These regions change depending on the cell type.
The Solution: Imagine the genome as a giant LEGO baseplate ruler. We don't define the brick type; we define where the bricks are placed. The token isn't a name, it's a set of coordinates: Chromosome number, Start position on the ruler, and End position.
Strategy: Cell2Sentence (C2S)
The Idea: Instead of stacking bricks to show expression level, lay them out in a horizontal line. High expression? Repeat that brick many times. Low expression? Just one or two repeats.
Why It Matters: This transforms cell data into text-like "sentences" that standard large language models (GPT, LLaMA) can understand. You can even ask natural language questions about cells!
Strategy: Amino Acid Tokenization
The Idea: Proteins are chains of just 20 different building blocks (amino acids). Each amino acid is one token - like having 20 different LEGO brick types that snap together in a chain.
Why It Works: Just like language models learn word patterns ("the" often follows "in"), protein models learn amino acid patterns that determine protein function and structure.
Strategy: Codon-Level Tokenization
The Idea: Instead of tokenizing single DNA letters (A, T, G, C), group them into triplets called codons. There are 64 possible triplet combinations (4^3).
Why It Matters: Multiple codons can code for the same amino acid (synonymous codons), but the choice affects translation speed and mRNA stability. "Silent" mutations can still cause disease!
Strategy: BPE + IUPAC Encoding for Diploid Genomes
The Problem: Reference genome models ignore individual genetic variation. How do you encode personalized genomes with heterozygous variants (where you inherited different alleles from each parent)?
The Solution: Use IUPAC ambiguity codes to represent heterozygous sites directly in the sequence (e.g., Y for C/T, R for A/G). Then apply Byte-Pair Encoding (BPE) to learn variable-length subword tokens that capture regulatory motifs.
Why It Works: This enables native diploid genome modeling without separate haplotype processing. The model learns the biological significance of both homozygous and heterozygous positions.
Strategy: K-mer Tokenization
The Idea: Instead of reading DNA one letter at a time (A, T, G, C), slide a window of K letters along the sequence. Each window position becomes one token. For 6-mers, there are 4^6 = 4,096 possible tokens.
Why It Works: It's like reading words instead of individual letters. "ATGCAT" carries more biological meaning than six separate letters, capturing local motifs and regulatory elements.
Strategy: Macrogene Tokenization
The Idea: Instead of using individual genes as tokens, group genes from ALL species into "macrogenes" based on protein sequence similarity (via ESM2). This creates a universal vocabulary that works across species.
Why It Works: Genes with similar protein functions cluster together regardless of species, enabling cross-species integration WITHOUT requiring one-to-one homolog mappings. Human CD4 and mouse Cd4 end up in the same macrogene!
Key Papers Implementing These Strategies
iSEEEK: Integration via Gene Rankings
Tokenization Strategy
Top 126 expressing genes per cell, ranked by expression level. Uses [CLS] and [SEP] tokens with MLM objective. Vocabulary: 20,706 protein-coding genes.
Scale: 11.9M cells | Context: 128 tokens | Params: ~10M
scGPT: Multi-task Foundation Model
Tokenization Strategy
Gene tokens paired with binned expression values (51 bins). Special condition tokens for perturbation modeling. Context limited to ~1,200 most variable genes.
Scale: 33M cells | Context: 1,200 | Params: ~100M
Nicheformer: Spatial-Aware Foundation Model
Tokenization Strategy
Rank-based with technology-specific mean normalization. Contextual tokens: <ORGANISM>, <ASSAY>, <MODALITY>. Cross-species gene mapping via orthologs.
Scale: 110M cells | Context: 1,500 | Params: 49.3M
ChromFound: scATAC-seq Foundation Model
Tokenization Strategy
Chromosome embedding + sinusoidal positional encoding of genomic coordinates (start/end). Linear accessibility embedding. Vocabulary-free approach for dynamic OCR landscapes.
Scale: 1.97M cells | Context: 440K OCRs | Params: 450K
ESM3: Multimodal Protein Language Model
Tokenization Strategy
Separate token tracks for sequence (amino acids), structure (discrete autoencoder), and function (keywords from InterPro/GO). All modalities fused in shared latent space with masked language modeling.
Scale: 2.78B proteins | Context: Sequence + 3D Structure | Params: 1.4B-98B
C2S-Scale: LLM-Scale Single-Cell Foundation
Tokenization Strategy
Cell2Sentence: expression encoded via token repetition (high expr = more repeats). GRPO refinement for biological task optimization. 8,192 token context.
Scale: 5.7M cells | Context: 8,192 | Params: 157M-27B
ProGen2: Protein Language Model Scaling
Tokenization Strategy
Standard amino acid tokenization with rotary positional encodings. Causal language modeling with next-token prediction. Context: 1,024-2,048 tokens.
Scale: UniRef90+BFD | Context: 2,048 AAs | Params: 151M-6.4B
VariantFormer: Personalized Gene Expression from Diploid Genomes
Tokenization Strategy
IUPAC ambiguity codes for heterozygous sites (R=A/G, Y=C/T, etc.) embedded into reference genome. BPE tokenizer (500 vocab) trained on cCREs. Hierarchical cross-attention between CRE (±1Mb) and gene body windows.
Scale: 2,330 donors, 50K genes | Context: >2Mb | Params: 1.2B
SATURN: Universal Cross-Species Embeddings
Tokenization Strategy
Genes clustered into ~2000 macrogenes via k-means on ESM2 protein embeddings (5120-dim). Gene-to-macrogene weights learned from protein similarity. Enables 350M-year divergent species integration.
Scale: 335K cells (3 species) | Context: ~2,000 macrogenes | Params: ~10M
Comparisons & Trade-offs
Tokenization Strategy Comparison
| Strategy | Data Type | Key Advantage | Limitation | Representative Model |
|---|---|---|---|---|
| Gene Rank-Based | scRNA-seq | Batch-insensitive, captures relative expression patterns | Loses absolute expression magnitude | iSEEEK, Geneformer |
| Expression Binning | scRNA-seq | Preserves expression magnitude, compatible with NLP architectures | Information loss from discretization | scGPT, scBERT |
| Genome-Coordinate | scATAC-seq | Vocabulary-free, handles novel regions | Requires reference genome alignment | ChromFound |
| K-mer Tokenization | DNA sequences | Captures local sequence patterns | Large vocabulary (4^k tokens) | Nucleotide Transformer |
| BPE + IUPAC | Diploid DNA + Variants | Native heterozygous encoding; personalized genome modeling | Requires phased VCF; expanded alphabet | VariantFormer |
| Amino Acid + Multimodal | Protein sequences + structure + function | Simple sequence tokens; multimodal enables structure/function reasoning | Ignores codon usage effects | ESM3, ProGen2 |
| Cell2Sentence | scRNA-seq | Compatible with standard LLMs, enables NL queries | Long sequences from repetition encoding | C2S-Scale |
| Macrogene | scRNA-seq (cross-species) | Enables cross-species integration without homologs via protein embeddings | Requires reference proteomes; loses gene-level resolution | SATURN |
Context Length vs Model Scale Trade-offs
Different tokenization strategies and architectures dictate the maximum sequence length (context window) a model can handle, which impacts the biological scope it can capture.
| Model | Tokenization Strategy | Context Length | Parameters |
|---|---|---|---|
| iSEEEK | Rank-Based (Top-K) | 128 tokens | ~10M |
| scGPT | Binning (High Variance Genes) | ~1,200 genes | ~100M |
| Nicheformer | Rank-Based (Top-K) | 1,500 tokens | 49.3M |
| AIDO.Cell | Auto-Discretization (Full Transcriptome) | 19,264 (full) | 650M |
| ChromFound | Genome-Coordinate (OCRs) | 440K OCRs (via Mamba) | 450K |
| ESM3 | Amino Acid + Structure + Function | Full protein (multimodal) | 98B |
| VariantFormer | BPE + IUPAC (Diploid) | >2 Mb (CRE ±1Mb + gene body) | 1.2B |
| C2S-Scale | Cell2Sentence (Repetition) | 8,192 tokens | 27B |
| SATURN | Macrogene (ESM2-based) | ~2,000 macrogenes | ~10M |
Choosing the Right Tokenization Strategy
| Use Case | Recommended Strategy | Why |
|---|---|---|
| Large-scale integration (>1M cells) across many labs | Gene Rank-Based | Naturally batch-insensitive; focuses on robust relative signals. |
| Perturbation modeling (predicting gene knockout effects) | Expression Binning | Preserves the absolute expression magnitude needed to model dosage changes. |
| Chromatin accessibility analysis (scATAC-seq) | Genome-Coordinate Tokenization | Handles dynamic open chromatin regions varying across cell types without a fixed vocabulary. |
| Protein fitness or structure prediction | Amino Acid Tokenization | Standard approach that effectively captures evolutionary constraints in protein sequences. |
| Personalized gene expression prediction from WGS | BPE + IUPAC | Encodes heterozygous variants natively; enables variant effect prediction from individual genomes. |
| Interacting with cell data using natural language | Cell2Sentence | Converts biological data into a format understood by standard Large Language Models. |