Cell Annotation & Type Identification

A Guide to Computational Tools for Single-Cell Analysis

Overview

Cell annotation assigns biological identities (cell types, states) to individual cells based on their transcriptomic profiles. This page provides a comprehensive guide to available tools, helping you choose the right method for your analysis.

General Annotation Workflow

Preprocess

QC & normalize

Select Tool

Based on data & needs

Annotate

Run classification

Validate

Check markers

Quick Reference

Tool Year Approach Best For Links
scmap 2018 Projection to reference Fast large-scale annotation Paper | GitHub
SingleR 2019 Correlation-based Bulk reference compatibility Paper | GitHub
CellAssign 2019 Probabilistic marker-based Multi-batch studies with markers Paper | GitHub
SCINA 2019 Semi-supervised mixture Simple marker-based Paper | GitHub
SingleCellNet 2019 Random forest Cross-platform robustness Paper | GitHub
CHETAH 2019 Hierarchical tree Uncertain/intermediate cells Paper | GitHub
CellTypist 2022 Logistic regression Immune cell typing Paper | GitHub
Azimuth 2023 Reference mapping (Seurat) Seurat ecosystem users Paper | GitHub
scGPT 2024 Foundation model (Transformer) Multi-task learning Paper | GitHub
scFoundation 2024 Foundation model (100M params) Low-depth data enhancement Paper | GitHub
CytoTRACE 2 2025 Interpretable deep learning Developmental potential Paper | GitHub
scCATCH 2020 Cluster-based + database Automatic cluster annotation Paper | GitHub
scDeepSort 2021 Pre-trained GNN Reference-free annotation Paper | GitHub
scBERT 2022 BERT-based transformer Deep learning annotation Paper | GitHub
Concerto 2022 Contrastive learning Million-scale mapping Paper | GitHub
CellHint 2023 Harmonization (PCT) Cross-dataset standardization Paper | GitHub
UCE 2023 Universal embeddings Zero-shot cross-species Paper | GitHub
scInterpreter 2024 LLM-based (Llama) Gene knowledge integration Paper (arXiv)
TCellSI 2024 T cell state scoring 8 T cell functional states Paper | GitHub
AIDO.Cell 2024 Full-transcriptome transformer 19K gene context Paper | GitHub
Nicheformer 2025 Spatial + dissociated FM Spatial context prediction Paper | GitHub

Reference-Based Methods

Map query cells to pre-annotated reference datasets

scmap Best for: Fast large-scale annotation 2018
Nature Methods | Wellcome Sanger Institute
Fast Projection Bioconductor
First scalable projection method for single-cell reference mapping. Offers two modes: scmap-cluster (fast, maps to cluster centroids) and scmap-cell (precise, maps to individual cells).
  • Dropout-based feature selection
  • Handles "unassigned" cells gracefully
  • Low memory footprint
  • Requires pre-computed indices for references
SingleR Best for: Bulk reference compatibility 2019
Nature Immunology | Weizmann Institute
Bulk Compatible Correlation Pre-built Refs
Correlates single-cell profiles with reference datasets (bulk or single-cell). Iteratively refines annotations using variable genes between top candidates.
  • Works with bulk RNA-seq references
  • Pre-built references (HPCA, Blueprint, Monaco)
  • Iterative fine-tuning refinement
  • Can be slow on very large datasets
Azimuth Best for: Seurat ecosystem users 2023
Cell (Seurat v4) | Satija Lab
Web App Seurat Multi-Modal ATAC-seq
Automated reference-based annotation using HuBMAP atlases. Single-command RunAzimuth() provides multi-resolution labels (l1, l2) with confidence scores. Supports ATAC-seq via bridge integration.
  • No-code web interface available
  • HuBMAP reference atlases (PBMC, lung, kidney...)
  • Bridge integration for ATAC-seq
  • Restricted to provided HuBMAP references

Marker Gene-Based Methods

Leverage prior knowledge of cell type marker genes

CellAssign Best for: Multi-batch studies with markers 2019
Nature Methods | BC Cancer
Batch Correction Probabilistic TensorFlow
Probabilistic framework using marker genes with explicit batch effect modeling. Provides uncertainty quantification and robust to ~30% marker misspecification.
  • Hierarchical Bayesian model (negative binomial)
  • Explicit batch/patient/sample modeling
  • GPU acceleration via TensorFlow
  • Performance sensitive to marker list quality/specificity
SCINA Best for: Simple marker-based annotation 2019
Genes | UT Southwestern
Simple Semi-Supervised EM Algorithm
Semi-supervised annotation using bimodal Gaussian mixture models for marker genes. Simple input: just provide marker gene lists per cell type.
  • Bimodal on/off expression model
  • No reference dataset required
  • Easy to use - just marker lists
  • Relies heavily on assumption of bimodal gene expression
CHETAH Best for: Uncertain/intermediate cells 2019
Nucleic Acids Research
Hierarchical Uncertainty Tree-Based
Hierarchical classification using reference tree structure. Stops at appropriate resolution when confidence drops, flagging uncertain cells.
  • Auto-determines annotation granularity
  • Identifies intermediate/ambiguous states
  • Confidence thresholds at each tree level
  • High rate of "unassigned" cells if reference is biologically distinct
scCATCH Best for: Automatic cluster annotation 2020
iScience | Zhejiang University
Cluster-Based CellMatch DB Evidence-Based
Automatic cluster annotation using CellMatch database (353 cell types, 20,792 markers across 184 tissues). Evidence-based scoring ranks candidates by marker matches and literature support.
  • Paired cluster comparison reduces false positives
  • CellMatch integrates CellMarker, MCA, CancerSEA
  • 83% average accuracy across tissues
  • Database-dependent; novel cell types may not be recognized

Machine Learning Classifiers

Traditional ML approaches for supervised classification

SingleCellNet Best for: Cross-platform robustness 2019
Cell Systems | Johns Hopkins
Cross-Platform Random Forest Calibrated
Random forest classifier using "top-pair" gene features that are robust to technical variation. Provides calibrated scores for quality assessment.
  • Rank-based gene pairs for cross-platform robustness
  • Train custom classifiers from your reference
  • Calibrated probability scores
  • Requires training data that closely matches target data type
CellTypist Best for: Immune cell typing 2022
Science | Wellcome Sanger Institute
Immune Cells Cross-Tissue 360K Cells 101 Types
Fast logistic regression trained on cross-tissue immune atlas (360K cells, 16 tissues, 101 cell types). Continuously updatable as new data becomes available.
  • F1: 0.95 (high) / 0.89 (low hierarchy)
  • Hierarchical: 32 broad → 91 fine types
  • SGD training - fast and scalable
  • Currently optimized primarily for immune cells
scDeepSort Best for: Reference-free annotation 2021
Nucleic Acids Research | Zhejiang University
Pre-trained GNN 265K Cells
First pre-trained graph neural network for cell type annotation. Treats cells and genes as graph nodes with expression as weighted edges. No additional reference required at prediction time.
  • 83.79% accuracy across 265,489 cells
  • Weighted graph aggregator handles batch effects
  • Outperforms 12 existing methods
  • Pre-trained on specific cell types; may struggle with novel types
scBERT Best for: Deep learning annotation 2022
Nature Machine Intelligence | Tencent
BERT Transformer Performer
BERT-based model adapted for scRNA-seq with Performer encoder. Two-stage pre-training + fine-tuning paradigm. Attention weights enable discovery of cell-type-specific genes.
  • Strong batch effect resistance
  • Interpretable attention for gene discovery
  • Novel cell type detection capability
  • Requires GPU for efficient training/inference
Concerto Best for: Million-scale mapping 2022
Nature Machine Intelligence | BGI
Contrastive Learning Self-Distillation Multi-Modal
Contrastive learning framework using asymmetric teacher-student self-distillation. First application of contrastive learning to single-cell. Supports multimodal RNA + protein integration.
  • Rapid mapping to million-scale atlases
  • NOTA (None-of-the-above) rejection for novel types
  • Hierarchical fine-grained annotation
  • Requires large training datasets for optimal performance
CellHint Best for: Cross-dataset standardization 2023
Cell | Wellcome Sanger Institute
Harmonization HCA PCT
Predictive clustering tree (PCT) tool for harmonizing cell type annotations across datasets. Creates hierarchical cell type relationships for standardized Human Cell Atlas integration.
  • Overcomes batch effects in cross-dataset comparisons
  • Annotation-aware data integration
  • From CellTypist team - compatible ecosystem
  • Requires multiple datasets with existing annotations

Foundation Models

Large-scale pre-trained models for general single-cell analysis

scGPT Best for: Multi-task learning 2024
Nature Methods | University of Toronto
Multi-Task Transformer 33M Cells Fine-Tunable
Foundation model trained on 33M human cells. Simultaneously learns cell and gene representations. Fine-tune for annotation, perturbation prediction, batch integration, and more.
  • Transformer with specialized attention masks
  • <cls> token for cell-level representation
  • Gene network inference from attention
  • Requires significant GPU resources for fine-tuning
scFoundation Best for: Low-depth data enhancement 2024
Nature Methods | BioMap
100M Params 50M Cells Low-Depth xTrimoGene
100M parameter model with xTrimoGene architecture and Read-Depth Aware (RDA) pre-training. Excels at gene expression enhancement and handles variable sequencing depths.
  • Asymmetric encoder-decoder (non-zero only)
  • RDA task handles low-depth data
  • Zero-shot expression enhancement
  • Very large model size makes local deployment challenging
UCE Best for: Zero-shot cross-species 2023
bioRxiv | Stanford & CZI
Cross-Species 650M Params 36M Cells ESM2
Universal Cell Embeddings using "Bags of RNA" approach with ESM2 protein embeddings. Zero-shot cell type classification across species without homolog mapping. 33-layer transformer on Integrated Mega-scale Atlas.
  • No cell type annotations needed for training
  • Cross-species annotation without fine-tuning
  • 1280-dimensional universal embeddings
  • Currently transcriptomics only; preprint status
scInterpreter Best for: Gene knowledge integration 2024
arXiv Preprint | Chinese Academy of Sciences
LLM-Based Llama-13b GPT Embeddings Preprint
Adapts LLMs (Llama-13b) to interpret scRNA-seq using GPT-3.5 gene description embeddings from NCBI. Bridges biological knowledge from language models with expression profiles.
  • LLM-encoded biological knowledge integration
  • Frozen LLM + lightweight projection layers
  • Gene-level semantic grounding via NCBI
  • Preprint; code not publicly available yet
AIDO.Cell Best for: Full 19K gene context 2024
NeurIPS | GenBio AI
Full Transcriptome 650M Params FlashAttention 19K Context
First single-cell FM to process entire 20K-gene transcriptome without truncation using dense Transformer + FlashAttention-2. Scaling study from 3M to 650M params on 50M cells.
  • No gene truncation/sampling - full context
  • Auto-discretization learns flexible expression embeddings
  • Read-depth aware pre-training
  • Requires 256 H100 GPUs for training; inference still heavy
Nicheformer Best for: Spatial context prediction 2025
Nature Methods | Helmholtz Munich
Spatial + Dissociated 110M Cells Multi-Tech
First foundation model trained on both dissociated (57M) and spatial (53M) transcriptomics. Learns spatially-aware representations enabling transfer of spatial context to scRNA-seq.
  • Predicts cellular microenvironments from expression
  • SpatialCorpus-110M across multiple technologies
  • Transfers spatial annotations to dissociated data
  • Spatial predictions may not generalize to all tissue contexts

Developmental Potential & Specialized

Methods for predicting cell potency and developmental states

CytoTRACE 2 Best for: Developmental potential 2025
Nature Methods | Stanford
Interpretable Potency Cross-Dataset GSBN
Predicts absolute developmental potential (0-1 scale) and potency categories (totipotent → differentiated) using interpretable Gene Set Binary Networks. Works across datasets without batch correction.
  • 6 potency categories (totipotent to differentiated)
  • Binary weights enable gene set extraction
  • τ = 0.82 correlation with ground truth
  • Focuses on potency state, not specific cell type identity
TCellSI Best for: T cell functional states 2024
iMeta | Huazhong University of Science and Technology
T Cells State Inference 8 States R Package
Specialized tool for inferring 8 distinct T cell functional states from scRNA-seq: Quiescence, Regulating, Proliferation, Helper, Cytotoxicity, Progenitor exhaustion, Terminal exhaustion, and Senescence.
  • 8 curated T cell functional state signatures
  • Mann-Whitney U statistics for robust scoring
  • Works with bulk RNA-seq and scRNA-seq
  • Specific to T cells; not generalizable to other cell types

Which Tool Should I Use?

Have a Reference Atlas?

  • Seurat user: Azimuth
  • Immune cells: CellTypist
  • Bulk reference: SingleR
  • Fast/simple: scmap

Have Marker Genes?

  • Multi-batch data: CellAssign
  • Simple/quick: SCINA
  • Uncertain cells: CHETAH

Multiple Tasks / Foundation Models?

  • Annotation + Integration + Perturbation: scGPT
  • Low-depth data: scFoundation
  • Cross-species zero-shot: UCE
  • Full 19K gene context: AIDO.Cell
  • Spatial + scRNA-seq: Nicheformer

Specialized Cell Types/States?

  • Developmental potency: CytoTRACE 2
  • T cell functional states: TCellSI
  • Cross-dataset harmonization: CellHint
  • Million-scale mapping: Concerto