Tokenization Strategies

Translating Biology into AI-Readable Language

Understanding how biological data is converted into tokens is fundamental to building effective foundation models. This guide explores the major tokenization strategies used in single-cell multi-omics, protein modeling, and DNA sequence analysis, using intuitive LEGO analogies to make these concepts accessible.

The Problem: Biology is Messy, AI Needs Order

The LEGO Analogy: Imagine a single cell as a giant bucket full of unsorted LEGO bricks.
Each type of brick (color/shape) represents a different Gene.
The number of bricks of that specific type represents its Expression Level (how active that gene is).

An AI model (like a Transformer) is a master builder, but it can't just grab a handful from the messy bucket. It needs the bricks sorted, labeled, and handed to it in a specific sequence. Tokenization is the process of organizing that messy bucket into a neat line of inputs the builder can use.

Depending on what we want the model to learn, we use different strategies to sort and present these "bricks".

Visual Guide to Core Tokenization Concepts

Before diving into the details, let's understand the core concepts visually using our LEGO analogy.

Gene Identity vs Expression Level A diagram showing different colored LEGO bricks representing gene identities, and stacks of different heights representing expression levels. 1. Raw Data (The Bucket) Gene A Count: 50 Gene B Count: 5 Tokenize 2. Tokens (AI Inputs) Gene A value: 50 (high) Gene B value: 5 (low)

The Basics: Identity & Count

A token is a single unit of information for the model. In single-cell data, we usually need two pieces of information combined:

  1. What is it? (The Gene Identity, e.g., "TP53" or a red brick).
  2. How much? (The Expression Value, e.g., "50 counts" or a stack of 50 bricks).

The challenge is how to combine these two very different types of information into a single vector representing the token.

Rank-Based Tokenization Analogy Two panels showing different cells with different stack heights but same relative ordering, resulting in identical token sequences. Cell 1: Good Quality A (100) 1st B (60) 2nd C (30) 3rd Cell 2: Batch Effect A (10) 1st B (5) 2nd C (2) 3rd Same Token Sequence! Gene A -> Gene B -> Gene C

Strategy: Rank-Based Tokenization

The Problem: Sometimes we have "technical noise" (batch effects). One experiment might yield giant stacks of bricks (high sequencing depth), and another yields tiny stacks, even for the same biological cell type.

The Solution: Ignore the exact height. Just line them up from tallest to shortest. As long as the relative order is preserved (Red is taller than Blue), the resulting sequence of tokens is the same, making the model immune to batch differences in sequencing depth.

Used in: iSEEEK, Geneformer
Expression Binning Analogy Continuous stacks of bricks of varying heights being sorted into three distinct buckets labeled Low, Medium, and High. Discretizing Continuous Values into Buckets Raw Expression Counts 5 45 120 LOW (1-20) MEDIUM (21-80) HIGH (81+)

Strategy: Expression Binning

The Problem: Standard language models work best with a fixed dictionary of words (categorical data). Actual expression counts are continuous numbers (1, 2, 50, 1000...).

The Solution: We create buckets (bins) for different ranges of heights. Instead of saying "Height 45", we throw it into the "Medium Height Bucket". Now, the token isn't a number, it's a category: [GeneID] + [MediumBin].

Used in: scGPT, scBERT
Read-Depth-Aware Tokenization Analogy Two cells with different sequencing depths, showing how depth tokens help the model understand and recover the original expression levels. Continuous Values + Depth Tokens Original Cell (Deep Sequencing) Total counts: 10,000 Gene A 100 Gene B 60 Gene C 20 Sampled Cell (Shallow) Sampled counts: 1,000 Gene A 10 Gene B 6 Gene C 2 Add depth tokens RDA Token Format Continuous Values [10, 6, 2, 0, ...] + T token 10,000 S token 1,000 Model learns to predict original (T=10,000) from sampled (S=1,000) data!

Strategy: Read-Depth-Aware (RDA) Tokenization

The Problem: Different cells have different sequencing depths—some have 10,000 total counts, others only 1,000. This makes expression values hard to compare directly.

The Solution: Keep the continuous expression values (don't discretize!), but add special "depth tokens" that tell the model: "This cell was supposed to have T=10,000 counts, but we only sampled S=1,000." The model learns to mentally "scale up" the values during pretraining.

Key Benefit: Enables expression enhancement—the model can predict what gene expression would look like at higher sequencing depth, effectively denoising sparse data.

Used in: scFoundation, AIDO.Cell
Genome Coordinate Tokenization Analogy A long green LEGO baseplate acting as a ruler with numbers. Clusters of bricks are placed at specific coordinate ranges, representing open chromatin regions. Location as Identity (Chromatin/ATAC-seq) Chr 1 0 200 400 600 800 1000 Region A Region B Resulting Tokens (Location-Based) Token A: [Chr1, 100, 180] Token B: [Chr1, 700, 760] No fixed gene vocabulary needed - regions can vary by cell type!

Strategy: Genome-Coordinate Tokenization

The Problem: In chromatin data (ATAC-seq), we don't have predefined "genes". We just have regions on the genome that are "open" (accessible). These regions change depending on the cell type.

The Solution: Imagine the genome as a giant LEGO baseplate ruler. We don't define the brick type; we define where the bricks are placed. The token isn't a name, it's a set of coordinates: Chromosome number, Start position on the ruler, and End position.

Used in: ChromFound
Cell2Sentence Tokenization Analogy Stacks of bricks transformed into horizontal sequences where height becomes repetition count. Expression Level to Token Repetition Traditional: Stack Height Gene A (high: 5) Gene B (low: 2) C2S Cell2Sentence: Repeat by Expression Gene A repeated 5x B x 2 "GeneA GeneA GeneA GeneA GeneA GeneB GeneB ..." Key Insight Higher expression = More token repeats in the "sentence" This lets standard LLMs (GPT, LLaMA) process cell data! C2S-Scale: Up to 27B parameters, 8,192 token context

Strategy: Cell2Sentence (C2S)

The Idea: Instead of stacking bricks to show expression level, lay them out in a horizontal line. High expression? Repeat that brick many times. Low expression? Just one or two repeats.

Why It Matters: This transforms cell data into text-like "sentences" that standard large language models (GPT, LLaMA) can understand. You can even ask natural language questions about cells!

Used in: C2S-Scale (27B params)
Amino Acid Tokenization Analogy 20 different colored LEGO bricks representing the amino acid alphabet, chained together to form a protein sequence. Protein = Chain of 20 Amino Acid "Bricks" The Vocabulary: 20 Amino Acid Types A H ... A Protein Sequence (Chain of Bricks) M A V ... Simple but Powerful Each amino acid = 1 token. The model learns which "bricks" tend to appear together, like grammar in language.

Strategy: Amino Acid Tokenization

The Idea: Proteins are chains of just 20 different building blocks (amino acids). Each amino acid is one token - like having 20 different LEGO brick types that snap together in a chain.

Why It Works: Just like language models learn word patterns ("the" often follows "in"), protein models learn amino acid patterns that determine protein function and structure.

Used in: ESM3, ProGen2
Codon-Level Tokenization Analogy Single nucleotide bricks grouped into triplets, showing how multiple triplet combinations can code for the same amino acid. Codons: 3-Brick Combos (64 Possible) Single Nucleotides (4 types) A T G C Group by 3 Codon Tokens (4^3 = 64 types) A T G Key Insight: Multiple Codons -> Same Amino Acid Leucine (L) can be coded by: UUA UUG ... All produce Leucine, but choice matters for translation speed! Why Codon Choice Matters "Silent" mutations (same amino acid) can still cause disease! Codon-level models capture translation efficiency & mRNA stability

Strategy: Codon-Level Tokenization

The Idea: Instead of tokenizing single DNA letters (A, T, G, C), group them into triplets called codons. There are 64 possible triplet combinations (4^3).

Why It Matters: Multiple codons can code for the same amino acid (synonymous codons), but the choice affects translation speed and mRNA stability. "Silent" mutations can still cause disease!

Used in: CodonFM, EnCodon
BPE + IUPAC Tokenization for Personalized Genomes A reference genome sequence with variants being encoded using IUPAC ambiguity codes, then tokenized using BPE into learned subword units. BPE + IUPAC: Personalized Diploid Genome Encoding Reference Genome (Generic) A T G C A T G ... Het SNP Het SNP IUPAC Encoding Personalized Diploid Genome (with IUPAC codes) A T G Y A T R ... Y = C/T (het) R = A/G (het) BPE Tokenization BPE Tokens (Learned Subwords, 500 vocab) ATG YAT R... Variable-length tokens Native Diploid Modeling Heterozygous variants (C/T, A/G) encoded directly via IUPAC codes (Y, R)

Strategy: BPE + IUPAC Encoding for Diploid Genomes

The Problem: Reference genome models ignore individual genetic variation. How do you encode personalized genomes with heterozygous variants (where you inherited different alleles from each parent)?

The Solution: Use IUPAC ambiguity codes to represent heterozygous sites directly in the sequence (e.g., Y for C/T, R for A/G). Then apply Byte-Pair Encoding (BPE) to learn variable-length subword tokens that capture regulatory motifs.

Why It Works: This enables native diploid genome modeling without separate haplotype processing. The model learns the biological significance of both homozygous and heterozygous positions.

Used in: VariantFormer
K-mer Tokenization Analogy A sliding window of 6 DNA bricks moving along a sequence, treating each window as one compound token. K-mer: Sliding Window of K Nucleotides DNA Sequence (Single Nucleotides) A T G C A T 6-mer Window ... Slide & Extract 6-mer Tokens (4^6 = 4,096 possible combinations) ATGCAT -> TGCATG -> ... Why Overlapping K-mers? Captures local sequence context that single nucleotides miss.

Strategy: K-mer Tokenization

The Idea: Instead of reading DNA one letter at a time (A, T, G, C), slide a window of K letters along the sequence. Each window position becomes one token. For 6-mers, there are 4^6 = 4,096 possible tokens.

Why It Works: It's like reading words instead of individual letters. "ATGCAT" carries more biological meaning than six separate letters, capturing local motifs and regulatory elements.

Used in: Nucleotide Transformer, DNABERT
Macrogene Tokenization Analogy Genes from different species grouped into functional clusters (macrogenes) based on protein similarity, enabling cross-species integration. Macrogene: Protein-Function-Based Gene Groups Human Genes CD4 MYOC Mouse Genes Cd4 Myoc Frog Genes cd4 myoc ESM2 Protein Language Model Encode protein sequences → 5120-dim embeddings K-means Clustering Macrogene 1: T-cell CD4 Cd4 cd4 Macrogene 2: Eye MYOC Myoc myoc Cross-Species Integration Without Homologs Genes grouped by protein function, not genomic position. Works even for species 350M years apart (frog ↔ zebrafish)!

Strategy: Macrogene Tokenization

The Idea: Instead of using individual genes as tokens, group genes from ALL species into "macrogenes" based on protein sequence similarity (via ESM2). This creates a universal vocabulary that works across species.

Why It Works: Genes with similar protein functions cluster together regardless of species, enabling cross-species integration WITHOUT requiring one-to-one homolog mappings. Human CD4 and mouse Cd4 end up in the same macrogene!

Used in: SATURN

Key Papers Implementing These Strategies

iSEEEK: Integration via Gene Rankings

Shen et al. | Briefings in Bioinformatics 2022
scRNA-seq Transformer Rank-Based
Pioneered gene rank-based tokenization treating single-cell transcriptomes as "sentences" of ranked genes. Processes 11.9M cells without explicit batch correction by focusing on relative gene rankings.
Tokenization Strategy

Top 126 expressing genes per cell, ranked by expression level. Uses [CLS] and [SEP] tokens with MLM objective. Vocabulary: 20,706 protein-coding genes.

Scale: 11.9M cells | Context: 128 tokens | Params: ~10M

scGPT: Multi-task Foundation Model

Cui et al. | Nature Methods 2024
scRNA-seq Transformer Binning + Special Tokens
Combines gene tokens with expression value tokens in a generative framework. Supports multiple tasks including batch correction, perturbation prediction, and multi-omics integration.
Tokenization Strategy

Gene tokens paired with binned expression values (51 bins). Special condition tokens for perturbation modeling. Context limited to ~1,200 most variable genes.

Scale: 33M cells | Context: 1,200 | Params: ~100M

Nicheformer: Spatial-Aware Foundation Model

Tejada-Lapuerta et al. | Nature Methods 2025
scRNA-seq + Spatial Transformer Contextual Tokens
First foundation model jointly learning from dissociated (57M) and spatial (53M) transcriptomics. Technology-specific normalization handles modality differences.
Tokenization Strategy

Rank-based with technology-specific mean normalization. Contextual tokens: <ORGANISM>, <ASSAY>, <MODALITY>. Cross-species gene mapping via orthologs.

Scale: 110M cells | Context: 1,500 | Params: 49.3M

ChromFound: scATAC-seq Foundation Model

Jiao et al. | NeurIPS 2025
scATAC-seq Mamba-Transformer Genome-Coordinate
First foundation model for chromatin accessibility. Uses genome-coordinate tokenization instead of fixed vocabularies, enabling representation of tissue-specific OCRs.
Tokenization Strategy

Chromosome embedding + sinusoidal positional encoding of genomic coordinates (start/end). Linear accessibility embedding. Vocabulary-free approach for dynamic OCR landscapes.

Scale: 1.97M cells | Context: 440K OCRs | Params: 450K

ESM3: Multimodal Protein Language Model

Hayes et al. | Science 2025
Protein Seq + Structure + Function Masked Transformer Multimodal Tokens
Frontier multimodal generative model reasoning over sequence, structure, and function. Generated esmGFP—a novel fluorescent protein equivalent to 500M years of evolution from known proteins.
Tokenization Strategy

Separate token tracks for sequence (amino acids), structure (discrete autoencoder), and function (keywords from InterPro/GO). All modalities fused in shared latent space with masked language modeling.

Scale: 2.78B proteins | Context: Sequence + 3D Structure | Params: 1.4B-98B

C2S-Scale: LLM-Scale Single-Cell Foundation

Rizvi et al. | bioRxiv 2025
scRNA-seq Decoder-only LLM Cell2Sentence
Scales single-cell foundation models to 27B parameters using Cell2Sentence representation. Enables natural language question answering and experimentally validated drug discovery.
Tokenization Strategy

Cell2Sentence: expression encoded via token repetition (high expr = more repeats). GRPO refinement for biological task optimization. 8,192 token context.

Scale: 5.7M cells | Context: 8,192 | Params: 157M-27B

ProGen2: Protein Language Model Scaling

Nijkamp et al. | Cell Systems 2023
Protein Sequences Autoregressive Transformer Amino Acid
Scales protein language models to 6.4B parameters, demonstrating that data distribution matters more than model size for fitness prediction. Generated sequences adopt natural folds.
Tokenization Strategy

Standard amino acid tokenization with rotary positional encodings. Causal language modeling with next-token prediction. Context: 1,024-2,048 tokens.

Scale: UniRef90+BFD | Context: 2,048 AAs | Params: 151M-6.4B

VariantFormer: Personalized Gene Expression from Diploid Genomes

Ghosal et al. | bioRxiv 2025
DNA + Variants Hierarchical Transformer BPE + IUPAC
First foundation model to predict tissue-specific gene expression from personalized diploid genomes by integrating individual genetic variants directly into DNA sequences using IUPAC ambiguity codes.
Tokenization Strategy

IUPAC ambiguity codes for heterozygous sites (R=A/G, Y=C/T, etc.) embedded into reference genome. BPE tokenizer (500 vocab) trained on cCREs. Hierarchical cross-attention between CRE (±1Mb) and gene body windows.

Scale: 2,330 donors, 50K genes | Context: >2Mb | Params: 1.2B

SATURN: Universal Cross-Species Embeddings

Rosen et al. | Nature Methods 2024
scRNA-seq (Multi-Species) Autoencoder + Metric Learning Macrogene
Universal cell embeddings via protein language models (ESM2) coupled with RNA expression. Enables cross-species integration WITHOUT one-to-one gene homologs by grouping genes into macrogenes.
Tokenization Strategy

Genes clustered into ~2000 macrogenes via k-means on ESM2 protein embeddings (5120-dim). Gene-to-macrogene weights learned from protein similarity. Enables 350M-year divergent species integration.

Scale: 335K cells (3 species) | Context: ~2,000 macrogenes | Params: ~10M

Comparisons & Trade-offs

Tokenization Strategy Comparison

Strategy Data Type Key Advantage Limitation Representative Model
Gene Rank-Based scRNA-seq Batch-insensitive, captures relative expression patterns Loses absolute expression magnitude iSEEEK, Geneformer
Expression Binning scRNA-seq Preserves expression magnitude, compatible with NLP architectures Information loss from discretization scGPT, scBERT
Genome-Coordinate scATAC-seq Vocabulary-free, handles novel regions Requires reference genome alignment ChromFound
K-mer Tokenization DNA sequences Captures local sequence patterns Large vocabulary (4^k tokens) Nucleotide Transformer
BPE + IUPAC Diploid DNA + Variants Native heterozygous encoding; personalized genome modeling Requires phased VCF; expanded alphabet VariantFormer
Amino Acid + Multimodal Protein sequences + structure + function Simple sequence tokens; multimodal enables structure/function reasoning Ignores codon usage effects ESM3, ProGen2
Cell2Sentence scRNA-seq Compatible with standard LLMs, enables NL queries Long sequences from repetition encoding C2S-Scale
Macrogene scRNA-seq (cross-species) Enables cross-species integration without homologs via protein embeddings Requires reference proteomes; loses gene-level resolution SATURN

Context Length vs Model Scale Trade-offs

Different tokenization strategies and architectures dictate the maximum sequence length (context window) a model can handle, which impacts the biological scope it can capture.

Model Tokenization Strategy Context Length Parameters
iSEEEK Rank-Based (Top-K) 128 tokens ~10M
scGPT Binning (High Variance Genes) ~1,200 genes ~100M
Nicheformer Rank-Based (Top-K) 1,500 tokens 49.3M
AIDO.Cell Auto-Discretization (Full Transcriptome) 19,264 (full) 650M
ChromFound Genome-Coordinate (OCRs) 440K OCRs (via Mamba) 450K
ESM3 Amino Acid + Structure + Function Full protein (multimodal) 98B
VariantFormer BPE + IUPAC (Diploid) >2 Mb (CRE ±1Mb + gene body) 1.2B
C2S-Scale Cell2Sentence (Repetition) 8,192 tokens 27B
SATURN Macrogene (ESM2-based) ~2,000 macrogenes ~10M

Choosing the Right Tokenization Strategy

Use Case Recommended Strategy Why
Large-scale integration (>1M cells) across many labs Gene Rank-Based Naturally batch-insensitive; focuses on robust relative signals.
Perturbation modeling (predicting gene knockout effects) Expression Binning Preserves the absolute expression magnitude needed to model dosage changes.
Chromatin accessibility analysis (scATAC-seq) Genome-Coordinate Tokenization Handles dynamic open chromatin regions varying across cell types without a fixed vocabulary.
Protein fitness or structure prediction Amino Acid Tokenization Standard approach that effectively captures evolutionary constraints in protein sequences.
Personalized gene expression prediction from WGS BPE + IUPAC Encodes heterozygous variants natively; enables variant effect prediction from individual genomes.
Interacting with cell data using natural language Cell2Sentence Converts biological data into a format understood by standard Large Language Models.