Overview
Cell annotation assigns biological identities (cell types, states) to individual cells based on their transcriptomic profiles. This page provides a comprehensive guide to available tools, helping you choose the right method for your analysis.
General Annotation Workflow
Preprocess
QC & normalize
Select Tool
Based on data & needs
Annotate
Run classification
Validate
Check markers
Quick Reference
| Tool | Approach | Best For |
|---|---|---|
| scmap | Projection to reference | Fast large-scale annotation |
| SingleR | Correlation-based | Bulk reference compatibility |
| CellAssign | Probabilistic marker-based | Multi-batch studies with markers |
| SCINA | Semi-supervised mixture | Simple marker-based |
| SingleCellNet | Random forest | Cross-platform robustness |
| CHETAH | Hierarchical tree | Uncertain/intermediate cells |
| CellTypist | Logistic regression | Immune cell typing |
| Azimuth | Reference mapping (Seurat) | Seurat ecosystem users |
| scGPT | Foundation model (Transformer) | Multi-task learning |
| scFoundation | Foundation model (100M params) | Low-depth data enhancement |
| CytoTRACE 2 | Interpretable deep learning | Developmental potential |
| scCATCH | Cluster-based + database | Automatic cluster annotation |
| scDeepSort | Pre-trained GNN | Reference-free annotation |
| scBERT | BERT-based transformer | Deep learning annotation |
| Concerto | Contrastive learning | Million-scale mapping |
| CellHint | Harmonization (PCT) | Cross-dataset standardization |
| UCE | Universal embeddings | Zero-shot cross-species |
| scInterpreter | LLM-based (Llama) | Gene knowledge integration |
| TCellSI | T cell state scoring | 8 T cell functional states |
| AIDO.Cell | Full-transcriptome transformer | 19K gene context |
| Nicheformer | Spatial + dissociated FM | Spatial context prediction |
Reference-Based Methods
Map query cells to pre-annotated reference datasets
scmap
Best for: Fast large-scale annotation
2018
Nature Methods | Wellcome Sanger Institute
First scalable projection method for single-cell reference mapping. Offers two modes: scmap-cluster (fast, maps to cluster centroids) and scmap-cell (precise, maps to individual cells).
- Dropout-based feature selection
- Handles "unassigned" cells gracefully
- Low memory footprint
- Requires pre-computed indices for references
SingleR
Best for: Bulk reference compatibility
2019
Nature Immunology | Weizmann Institute
Correlates single-cell profiles with reference datasets (bulk or single-cell). Iteratively refines annotations using variable genes between top candidates.
- Works with bulk RNA-seq references
- Pre-built references (HPCA, Blueprint, Monaco)
- Iterative fine-tuning refinement
- Can be slow on very large datasets
Azimuth
Best for: Seurat ecosystem users
2023
Cell (Seurat v4) | Satija Lab
Automated reference-based annotation using HuBMAP atlases. Single-command
RunAzimuth() provides multi-resolution labels (l1, l2) with confidence scores. Supports ATAC-seq via bridge integration.
- No-code web interface available
- HuBMAP reference atlases (PBMC, lung, kidney...)
- Bridge integration for ATAC-seq
- Restricted to provided HuBMAP references
Marker Gene-Based Methods
Leverage prior knowledge of cell type marker genes
CellAssign
Best for: Multi-batch studies with markers
2019
Nature Methods | BC Cancer
Probabilistic framework using marker genes with explicit batch effect modeling. Provides uncertainty quantification and robust to ~30% marker misspecification.
- Hierarchical Bayesian model (negative binomial)
- Explicit batch/patient/sample modeling
- GPU acceleration via TensorFlow
- Performance sensitive to marker list quality/specificity
SCINA
Best for: Simple marker-based annotation
2019
Genes | UT Southwestern
Semi-supervised annotation using bimodal Gaussian mixture models for marker genes. Simple input: just provide marker gene lists per cell type.
- Bimodal on/off expression model
- No reference dataset required
- Easy to use - just marker lists
- Relies heavily on assumption of bimodal gene expression
CHETAH
Best for: Uncertain/intermediate cells
2019
Nucleic Acids Research
Hierarchical classification using reference tree structure. Stops at appropriate resolution when confidence drops, flagging uncertain cells.
- Auto-determines annotation granularity
- Identifies intermediate/ambiguous states
- Confidence thresholds at each tree level
- High rate of "unassigned" cells if reference is biologically distinct
scCATCH
Best for: Automatic cluster annotation
2020
iScience | Zhejiang University
Automatic cluster annotation using CellMatch database (353 cell types, 20,792 markers across 184 tissues). Evidence-based scoring ranks candidates by marker matches and literature support.
- Paired cluster comparison reduces false positives
- CellMatch integrates CellMarker, MCA, CancerSEA
- 83% average accuracy across tissues
- Database-dependent; novel cell types may not be recognized
Machine Learning Classifiers
Traditional ML approaches for supervised classification
SingleCellNet
Best for: Cross-platform robustness
2019
Cell Systems | Johns Hopkins
Random forest classifier using "top-pair" gene features that are robust to technical variation. Provides calibrated scores for quality assessment.
- Rank-based gene pairs for cross-platform robustness
- Train custom classifiers from your reference
- Calibrated probability scores
- Requires training data that closely matches target data type
CellTypist
Best for: Immune cell typing
2022
Science | Wellcome Sanger Institute
Fast logistic regression trained on cross-tissue immune atlas (360K cells, 16 tissues, 101 cell types). Continuously updatable as new data becomes available.
- F1: 0.95 (high) / 0.89 (low hierarchy)
- Hierarchical: 32 broad → 91 fine types
- SGD training - fast and scalable
- Currently optimized primarily for immune cells
scDeepSort
Best for: Reference-free annotation
2021
Nucleic Acids Research | Zhejiang University
First pre-trained graph neural network for cell type annotation. Treats cells and genes as graph nodes with expression as weighted edges. No additional reference required at prediction time.
- 83.79% accuracy across 265,489 cells
- Weighted graph aggregator handles batch effects
- Outperforms 12 existing methods
- Pre-trained on specific cell types; may struggle with novel types
scBERT
Best for: Deep learning annotation
2022
Nature Machine Intelligence | Tencent
BERT-based model adapted for scRNA-seq with Performer encoder. Two-stage pre-training + fine-tuning paradigm. Attention weights enable discovery of cell-type-specific genes.
- Strong batch effect resistance
- Interpretable attention for gene discovery
- Novel cell type detection capability
- Requires GPU for efficient training/inference
Concerto
Best for: Million-scale mapping
2022
Nature Machine Intelligence | BGI
Contrastive learning framework using asymmetric teacher-student self-distillation. First application of contrastive learning to single-cell. Supports multimodal RNA + protein integration.
- Rapid mapping to million-scale atlases
- NOTA (None-of-the-above) rejection for novel types
- Hierarchical fine-grained annotation
- Requires large training datasets for optimal performance
CellHint
Best for: Cross-dataset standardization
2023
Cell | Wellcome Sanger Institute
Predictive clustering tree (PCT) tool for harmonizing cell type annotations across datasets. Creates hierarchical cell type relationships for standardized Human Cell Atlas integration.
- Overcomes batch effects in cross-dataset comparisons
- Annotation-aware data integration
- From CellTypist team - compatible ecosystem
- Requires multiple datasets with existing annotations
Foundation Models
Large-scale pre-trained models for general single-cell analysis
scGPT
Best for: Multi-task learning
2024
Nature Methods | University of Toronto
Foundation model trained on 33M human cells. Simultaneously learns cell and gene representations. Fine-tune for annotation, perturbation prediction, batch integration, and more.
- Transformer with specialized attention masks
- <cls> token for cell-level representation
- Gene network inference from attention
- Requires significant GPU resources for fine-tuning
scFoundation
Best for: Low-depth data enhancement
2024
Nature Methods | BioMap
100M parameter model with xTrimoGene architecture and Read-Depth Aware (RDA) pre-training. Excels at gene expression enhancement and handles variable sequencing depths.
- Asymmetric encoder-decoder (non-zero only)
- RDA task handles low-depth data
- Zero-shot expression enhancement
- Very large model size makes local deployment challenging
UCE
Best for: Zero-shot cross-species
2023
bioRxiv | Stanford & CZI
Universal Cell Embeddings using "Bags of RNA" approach with ESM2 protein embeddings. Zero-shot cell type classification across species without homolog mapping. 33-layer transformer on Integrated Mega-scale Atlas.
- No cell type annotations needed for training
- Cross-species annotation without fine-tuning
- 1280-dimensional universal embeddings
- Currently transcriptomics only; preprint status
scInterpreter
Best for: Gene knowledge integration
2024
arXiv Preprint | Chinese Academy of Sciences
Adapts LLMs (Llama-13b) to interpret scRNA-seq using GPT-3.5 gene description embeddings from NCBI. Bridges biological knowledge from language models with expression profiles.
- LLM-encoded biological knowledge integration
- Frozen LLM + lightweight projection layers
- Gene-level semantic grounding via NCBI
- Preprint; code not publicly available yet
AIDO.Cell
Best for: Full 19K gene context
2024
NeurIPS | GenBio AI
First single-cell FM to process entire 20K-gene transcriptome without truncation using dense Transformer + FlashAttention-2. Scaling study from 3M to 650M params on 50M cells.
- No gene truncation/sampling - full context
- Auto-discretization learns flexible expression embeddings
- Read-depth aware pre-training
- Requires 256 H100 GPUs for training; inference still heavy
Nicheformer
Best for: Spatial context prediction
2025
Nature Methods | Helmholtz Munich
First foundation model trained on both dissociated (57M) and spatial (53M) transcriptomics. Learns spatially-aware representations enabling transfer of spatial context to scRNA-seq.
- Predicts cellular microenvironments from expression
- SpatialCorpus-110M across multiple technologies
- Transfers spatial annotations to dissociated data
- Spatial predictions may not generalize to all tissue contexts
Developmental Potential & Specialized
Methods for predicting cell potency and developmental states
CytoTRACE 2
Best for: Developmental potential
2025
Nature Methods | Stanford
Predicts absolute developmental potential (0-1 scale) and potency categories (totipotent → differentiated) using interpretable Gene Set Binary Networks. Works across datasets without batch correction.
- 6 potency categories (totipotent to differentiated)
- Binary weights enable gene set extraction
- τ = 0.82 correlation with ground truth
- Focuses on potency state, not specific cell type identity
TCellSI
Best for: T cell functional states
2024
iMeta | Huazhong University of Science and Technology
Specialized tool for inferring 8 distinct T cell functional states from scRNA-seq: Quiescence, Regulating, Proliferation, Helper, Cytotoxicity, Progenitor exhaustion, Terminal exhaustion, and Senescence.
- 8 curated T cell functional state signatures
- Mann-Whitney U statistics for robust scoring
- Works with bulk RNA-seq and scRNA-seq
- Specific to T cells; not generalizable to other cell types
Which Tool Should I Use?
Have a Reference Atlas?
- Seurat user: Azimuth
- Immune cells: CellTypist
- Bulk reference: SingleR
- Fast/simple: scmap
Have Marker Genes?
- Multi-batch data: CellAssign
- Simple/quick: SCINA
- Uncertain cells: CHETAH
Multiple Tasks / Foundation Models?
- Annotation + Integration + Perturbation: scGPT
- Low-depth data: scFoundation
- Cross-species zero-shot: UCE
- Full 19K gene context: AIDO.Cell
- Spatial + scRNA-seq: Nicheformer
Specialized Cell Types/States?
- Developmental potency: CytoTRACE 2
- T cell functional states: TCellSI
- Cross-dataset harmonization: CellHint
- Million-scale mapping: Concerto