Tokenization Strategies

Translating Biology into AI-Readable Language

Understanding how biological data is converted into tokens is fundamental to building effective foundation models. This guide explores the major tokenization strategies used in single-cell multi-omics, protein modeling, and DNA sequence analysis, using intuitive LEGO analogies to make these concepts accessible.

The Problem: Biology is Messy, AI Needs Order

The LEGO Analogy: Imagine a single cell as a giant bucket full of unsorted LEGO bricks.
Each type of brick (color/shape) represents a different Gene.
The number of bricks of that specific type represents its Expression Level (how active that gene is).

An AI model (like a Transformer) is a master builder, but it can't just grab a handful from the messy bucket. It needs the bricks sorted, labeled, and handed to it in a specific sequence. Tokenization is the process of organizing that messy bucket into a neat line of inputs the builder can use.

Depending on what we want the model to learn, we use different strategies to sort and present these "bricks".

Visual Guide to Core Tokenization Concepts

Before diving into the details, let's understand the core concepts visually using our LEGO analogy.

The Basics: Identity & Count

A token is a single unit of information for the model. In single-cell data, we usually need two pieces of information combined:

What is it? (The Gene Identity, e.g., "TP53" or a red brick).
How much? (The Expression Value, e.g., "50 counts" or a stack of 50 bricks).

The challenge is how to combine these two very different types of information into a single vector representing the token.

Strategy: Rank-Based Tokenization

The Problem: Sometimes we have "technical noise" (batch effects). One experiment might yield giant stacks of bricks (high sequencing depth), and another yields tiny stacks, even for the same biological cell type.

The Solution: Ignore the exact height. Just line them up from tallest to shortest. As long as the relative order is preserved (Red is taller than Blue), the resulting sequence of tokens is the same, making the model immune to batch differences in sequencing depth.

Used in: iSEEEK, Geneformer

Strategy: Expression Binning

The Problem: Standard language models work best with a fixed dictionary of words (categorical data). Actual expression counts are continuous numbers (1, 2, 50, 1000...).

The Solution: We create buckets (bins) for different ranges of heights. Instead of saying "Height 45", we throw it into the "Medium Height Bucket". Now, the token isn't a number, it's a category: [GeneID] + [MediumBin].

Used in: scGPT, scBERT

Strategy: Read-Depth-Aware (RDA) Tokenization

The Problem: Different cells have different sequencing depths—some have 10,000 total counts, others only 1,000. This makes expression values hard to compare directly.

The Solution: Keep the continuous expression values (don't discretize!), but add special "depth tokens" that tell the model: "This cell was supposed to have T=10,000 counts, but we only sampled S=1,000." The model learns to mentally "scale up" the values during pretraining.

Key Benefit: Enables expression enhancement—the model can predict what gene expression would look like at higher sequencing depth, effectively denoising sparse data.

Used in: scFoundation, AIDO.Cell

Strategy: Genome-Coordinate Tokenization

The Problem: In chromatin data (ATAC-seq), we don't have predefined "genes". We just have regions on the genome that are "open" (accessible). These regions change depending on the cell type.

The Solution: Imagine the genome as a giant LEGO baseplate ruler. We don't define the brick type; we define where the bricks are placed. The token isn't a name, it's a set of coordinates: Chromosome number, Start position on the ruler, and End position.

Used in: ChromFound

Strategy: Cell2Sentence (C2S)

The Idea: Instead of stacking bricks to show expression level, lay them out in a horizontal line. High expression? Repeat that brick many times. Low expression? Just one or two repeats.

Why It Matters: This transforms cell data into text-like "sentences" that standard large language models (GPT, LLaMA) can understand. You can even ask natural language questions about cells!

Used in: C2S-Scale (27B params)

Strategy: Amino Acid Tokenization

The Idea: Proteins are chains of just 20 different building blocks (amino acids). Each amino acid is one token - like having 20 different LEGO brick types that snap together in a chain.

Why It Works: Just like language models learn word patterns ("the" often follows "in"), protein models learn amino acid patterns that determine protein function and structure.

Used in: ESM3, ProGen2

Strategy: Codon-Level Tokenization

The Idea: Instead of tokenizing single DNA letters (A, T, G, C), group them into triplets called codons. There are 64 possible triplet combinations (4^3).

Why It Matters: Multiple codons can code for the same amino acid (synonymous codons), but the choice affects translation speed and mRNA stability. "Silent" mutations can still cause disease!

Used in: CodonFM, EnCodon

Strategy: BPE + IUPAC Encoding for Diploid Genomes

The Problem: Reference genome models ignore individual genetic variation. How do you encode personalized genomes with heterozygous variants (where you inherited different alleles from each parent)?

The Solution: Use IUPAC ambiguity codes to represent heterozygous sites directly in the sequence (e.g., Y for C/T, R for A/G). Then apply Byte-Pair Encoding (BPE) to learn variable-length subword tokens that capture regulatory motifs.

Why It Works: This enables native diploid genome modeling without separate haplotype processing. The model learns the biological significance of both homozygous and heterozygous positions.

Used in: VariantFormer

Strategy: K-mer Tokenization

The Idea: Instead of reading DNA one letter at a time (A, T, G, C), slide a window of K letters along the sequence. Each window position becomes one token. For 6-mers, there are 4^6 = 4,096 possible tokens.

Why It Works: It's like reading words instead of individual letters. "ATGCAT" carries more biological meaning than six separate letters, capturing local motifs and regulatory elements.

Used in: Nucleotide Transformer, DNABERT

Strategy: Macrogene Tokenization

The Idea: Instead of using individual genes as tokens, group genes from ALL species into "macrogenes" based on protein sequence similarity (via ESM2). This creates a universal vocabulary that works across species.

Why It Works: Genes with similar protein functions cluster together regardless of species, enabling cross-species integration WITHOUT requiring one-to-one homolog mappings. Human CD4 and mouse Cd4 end up in the same macrogene!

Used in: SATURN

Key Papers Implementing These Strategies

iSEEEK: Integration via Gene Rankings

Shen et al. | Briefings in Bioinformatics 2022

scRNA-seq Transformer Rank-Based

Pioneered gene rank-based tokenization treating single-cell transcriptomes as "sentences" of ranked genes. Processes 11.9M cells without explicit batch correction by focusing on relative gene rankings.

Tokenization Strategy

Top 126 expressing genes per cell, ranked by expression level. Uses [CLS] and [SEP] tokens with MLM objective. Vocabulary: 20,706 protein-coding genes.

Scale: 11.9M cells | Context: 128 tokens | Params: ~10M

Paper GitHub

scGPT: Multi-task Foundation Model

Cui et al. | Nature Methods 2024

scRNA-seq Transformer Binning + Special Tokens

Combines gene tokens with expression value tokens in a generative framework. Supports multiple tasks including batch correction, perturbation prediction, and multi-omics integration.

Tokenization Strategy

Gene tokens paired with binned expression values (51 bins). Special condition tokens for perturbation modeling. Context limited to ~1,200 most variable genes.

Scale: 33M cells | Context: 1,200 | Params: ~100M

Paper GitHub 🤗 Hugging Face

Nicheformer: Spatial-Aware Foundation Model

Tejada-Lapuerta et al. | Nature Methods 2025

scRNA-seq + Spatial Transformer Contextual Tokens

First foundation model jointly learning from dissociated (57M) and spatial (53M) transcriptomics. Technology-specific normalization handles modality differences.

Tokenization Strategy

Rank-based with technology-specific mean normalization. Contextual tokens: <ORGANISM>, <ASSAY>, <MODALITY>. Cross-species gene mapping via orthologs.

Scale: 110M cells | Context: 1,500 | Params: 49.3M

Paper GitHub 🤗 Hugging Face

ChromFound: scATAC-seq Foundation Model

Jiao et al. | NeurIPS 2025

scATAC-seq Mamba-Transformer Genome-Coordinate

First foundation model for chromatin accessibility. Uses genome-coordinate tokenization instead of fixed vocabularies, enabling representation of tissue-specific OCRs.

Tokenization Strategy

Chromosome embedding + sinusoidal positional encoding of genomic coordinates (start/end). Linear accessibility embedding. Vocabulary-free approach for dynamic OCR landscapes.

Scale: 1.97M cells | Context: 440K OCRs | Params: 450K

Paper GitHub

ESM3: Multimodal Protein Language Model

Hayes et al. | Science 2025

Protein Seq + Structure + Function Masked Transformer Multimodal Tokens

Frontier multimodal generative model reasoning over sequence, structure, and function. Generated esmGFP—a novel fluorescent protein equivalent to 500M years of evolution from known proteins.

Tokenization Strategy

Separate token tracks for sequence (amino acids), structure (discrete autoencoder), and function (keywords from InterPro/GO). All modalities fused in shared latent space with masked language modeling.

Scale: 2.78B proteins | Context: Sequence + 3D Structure | Params: 1.4B-98B

Paper GitHub 🤗 Hugging Face

C2S-Scale: LLM-Scale Single-Cell Foundation

Rizvi et al. | bioRxiv 2025

scRNA-seq Decoder-only LLM Cell2Sentence

Scales single-cell foundation models to 27B parameters using Cell2Sentence representation. Enables natural language question answering and experimentally validated drug discovery.

Tokenization Strategy

Cell2Sentence: expression encoded via token repetition (high expr = more repeats). GRPO refinement for biological task optimization. 8,192 token context.

Scale: 5.7M cells | Context: 8,192 | Params: 157M-27B

Paper 🤗 Hugging Face

ProGen2: Protein Language Model Scaling

Nijkamp et al. | Cell Systems 2023

Protein Sequences Autoregressive Transformer Amino Acid

Scales protein language models to 6.4B parameters, demonstrating that data distribution matters more than model size for fitness prediction. Generated sequences adopt natural folds.

Tokenization Strategy

Standard amino acid tokenization with rotary positional encodings. Causal language modeling with next-token prediction. Context: 1,024-2,048 tokens.

Scale: UniRef90+BFD | Context: 2,048 AAs | Params: 151M-6.4B

Paper Code Data

VariantFormer: Personalized Gene Expression from Diploid Genomes

Ghosal et al. | bioRxiv 2025

DNA + Variants Hierarchical Transformer BPE + IUPAC

First foundation model to predict tissue-specific gene expression from personalized diploid genomes by integrating individual genetic variants directly into DNA sequences using IUPAC ambiguity codes.

Tokenization Strategy

IUPAC ambiguity codes for heterozygous sites (R=A/G, Y=C/T, etc.) embedded into reference genome. BPE tokenizer (500 vocab) trained on cCREs. Hierarchical cross-attention between CRE (±1Mb) and gene body windows.

Scale: 2,330 donors, 50K genes | Context: >2Mb | Params: 1.2B

Paper GitHub CZI's Biohub

SATURN: Universal Cross-Species Embeddings

Rosen et al. | Nature Methods 2024

scRNA-seq (Multi-Species) Autoencoder + Metric Learning Macrogene

Universal cell embeddings via protein language models (ESM2) coupled with RNA expression. Enables cross-species integration WITHOUT one-to-one gene homologs by grouping genes into macrogenes.

Tokenization Strategy

Genes clustered into ~2000 macrogenes via k-means on ESM2 protein embeddings (5120-dim). Gene-to-macrogene weights learned from protein similarity. Enables 350M-year divergent species integration.

Scale: 335K cells (3 species) | Context: ~2,000 macrogenes | Params: ~10M

Paper GitHub

Comparisons & Trade-offs

Tokenization Strategy Comparison

Strategy	Data Type	Key Advantage	Limitation	Representative Model
Gene Rank-Based	scRNA-seq	Batch-insensitive, captures relative expression patterns	Loses absolute expression magnitude	iSEEEK, Geneformer
Expression Binning	scRNA-seq	Preserves expression magnitude, compatible with NLP architectures	Information loss from discretization	scGPT, scBERT
Genome-Coordinate	scATAC-seq	Vocabulary-free, handles novel regions	Requires reference genome alignment	ChromFound
K-mer Tokenization	DNA sequences	Captures local sequence patterns	Large vocabulary (4^k tokens)	Nucleotide Transformer
BPE + IUPAC	Diploid DNA + Variants	Native heterozygous encoding; personalized genome modeling	Requires phased VCF; expanded alphabet	VariantFormer
Amino Acid + Multimodal	Protein sequences + structure + function	Simple sequence tokens; multimodal enables structure/function reasoning	Ignores codon usage effects	ESM3, ProGen2
Cell2Sentence	scRNA-seq	Compatible with standard LLMs, enables NL queries	Long sequences from repetition encoding	C2S-Scale
Macrogene	scRNA-seq (cross-species)	Enables cross-species integration without homologs via protein embeddings	Requires reference proteomes; loses gene-level resolution	SATURN

Context Length vs Model Scale Trade-offs

Different tokenization strategies and architectures dictate the maximum sequence length (context window) a model can handle, which impacts the biological scope it can capture.

Model	Tokenization Strategy	Context Length	Parameters
iSEEEK	Rank-Based (Top-K)	128 tokens	~10M
scGPT	Binning (High Variance Genes)	~1,200 genes	~100M
Nicheformer	Rank-Based (Top-K)	1,500 tokens	49.3M
AIDO.Cell	Auto-Discretization (Full Transcriptome)	19,264 (full)	650M
ChromFound	Genome-Coordinate (OCRs)	440K OCRs (via Mamba)	450K
ESM3	Amino Acid + Structure + Function	Full protein (multimodal)	98B
VariantFormer	BPE + IUPAC (Diploid)	>2 Mb (CRE ±1Mb + gene body)	1.2B
C2S-Scale	Cell2Sentence (Repetition)	8,192 tokens	27B
SATURN	Macrogene (ESM2-based)	~2,000 macrogenes	~10M

Choosing the Right Tokenization Strategy

Use Case	Recommended Strategy	Why
Large-scale integration (>1M cells) across many labs	Gene Rank-Based	Naturally batch-insensitive; focuses on robust relative signals.
Perturbation modeling (predicting gene knockout effects)	Expression Binning	Preserves the absolute expression magnitude needed to model dosage changes.
Chromatin accessibility analysis (scATAC-seq)	Genome-Coordinate Tokenization	Handles dynamic open chromatin regions varying across cell types without a fixed vocabulary.
Protein fitness or structure prediction	Amino Acid Tokenization	Standard approach that effectively captures evolutionary constraints in protein sequences.
Personalized gene expression prediction from WGS	BPE + IUPAC	Encodes heterozygous variants natively; enables variant effect prediction from individual genomes.
Interacting with cell data using natural language	Cell2Sentence	Converts biological data into a format understood by standard Large Language Models.