Single-Cell Quality Control & Preprocessing
A Comprehensive Methods Repository (2018-2026)
From raw sequencing data to analysis-ready matrices. Whether you need to detect doublets, remove ambient RNA contamination, impute missing values, or leverage foundation models for quality control, find the right computational tool for your single-cell preprocessing workflow.
Core QC Pipelines & Frameworks
SCTK (singleCellTK)
Comprehensive R/Bioconductor pipeline integrating 7 doublet detection and 3 decontamination algorithms into a unified framework. Features automated parameter optimization, consensus-based filtering, and supports both interactive (Shiny GUI) and command-line workflows.
Doublet Detection
What a doublet is — and how simulation catches it
Two cells captured in one droplet share a barcode, so they look like a single cell with a blended profile. Detectors spot them by simulating artificial doublets and asking which real cells resemble them.
Key insight: tools like Scrublet and DoubletFinder generate synthetic doublets by averaging random pairs of real cells, embed everything together, and score each real cell by how many simulated doublets sit nearby. Cells in dense simulated-doublet neighborhoods are flagged and removed before analysis.
DoubletFinder
Identifies doublets by creating artificial doublets (pK-NN approach) and measuring proximity in gene expression space. Features parametric parameter sweeping (pK/pN optimization) for dataset-specific calibration. Integrated into Seurat workflows.
Ambient RNA & Contamination Removal
SoupX
Estimates and removes ambient RNA contamination using empty droplet profiles. Uses cluster-specific marker genes to estimate contamination fraction per cell. Works with 10x Genomics data and integrates with Seurat.
DecontX
Bayesian method for decontamination using variational inference. Models each cell's expression as a mixture of native and contaminating RNA. Provides uncertainty estimates and works without clustering pre-requisites.
scCLEAN
Molecular (wet-lab) method, not a computational tool. Single-cell CRISPRclean uses CRISPR/Cas9 to target and remove highly abundant, uninformative transcripts (both genomic and transcriptomic) from prepared cDNA before sequencing, redistributing roughly half of reads toward lower-abundance transcripts. Improves detection of rare/low-expression genes in a way that cannot be matched by deeper sequencing alone; complements — rather than replaces — computational decontamination.
Ambient RNA Analysis (snRNA-seq)
Comprehensive analysis of ambient RNA contamination specifically in single-nucleus RNA-seq (snRNA-seq). Provides guidelines for contamination assessment and benchmarks decontamination tools for nuclear preparations.
Feature Selection & Dimensionality Reduction
DELVE
Dynamic feature selection method that identifies genes with coherent expression patterns across local cell neighborhoods. Uses Laplacian scores to select features that capture continuous biological processes like differentiation trajectories.
Foundation Models for Single-Cell Analysis
scPRINT
Large foundation model pre-trained on 50M+ cells for diverse single-cell tasks. Provides cell and gene embeddings enabling zero-shot annotation, denoising, and batch correction. Supports multi-task learning across preprocessing applications.
SIGnature
Applies gradient-based attribution methods (Integrated Gradients, Input×Gradient, DeepLIFT) to scRNA-seq foundation models to score per-gene importance at the single-cell level. Attribution scores are strikingly robust to the technical artifacts that QC tries to manage: they correlate far less with per-cell UMI counts and library complexity than raw expression, automatically demote ribosomal and mitochondrial genes, and remain 93% stable under 50% simulated dropout. This makes SIGnature's gene-importance signal substantially less confounded by sequencing depth, dropout events, and ambient contamination than log-normalized counts alone.
- Attributions promote cell-type marker genes and lineage TFs (GATA3, RORC, FOXP3) while demoting ribo/mito genes vs expression (Wilcoxon P<0.01)
- Top-ranked genes 93% stable under 50% dropout simulation — robust to the very missingness QC filters aim to reduce
- Attribution-based NMF gene programs less batch/study-specific than expression-based; FOXP3 a top Treg gene in 98% of attribution configs vs 4% for expression
- Attribution signature scoring tops F1 in 23/32 tasks, beating ANS, Scanpy, UCell, JASMINE, and mean expression
- Scales to ~22 million cells / 412 studies in minutes; enables large-scale QC-aware gene-program discovery
Metacell & Statistical Methods
MetaQ
Metacell inference via deep single-cell quantization. Quantizes cells into a discrete codebook where each entry (a metacell) can reconstruct the similar cells it represents, reducing complexity from exponential to linear with constant memory. Scales to arbitrarily large datasets and supports both uni- and multi-omics data while preserving biological heterogeneity.
Data Repositories & Standardization
scBaseCount (formerly scBaseCamp)
AI agent-curated, continually expanding repository — now over 500M uniformly processed cells across 27 organisms and 75 tissues (the largest freely accessible single-cell collection). An AI agent (SRAgent) automates dataset discovery and metadata extraction from the SRA, and the scRecounter Nextflow pipeline performs standardized reprocessing. Released as part of the Arc Virtual Cell Atlas. Note: first circulated under the name "scBaseCamp" (230M cells); later renamed scBaseCount as it expanded.
⚖️ Side-by-Side: Ambient RNA Removal — Subtraction vs Bayesian Mixture
SoupX and DecontX both correct ambient RNA contamination in droplet-based scRNA-seq, but they diverge on the fundamental statistical question: should contamination be estimated via cluster-driven marker subtraction (SoupX) or inferred as a per-cell Bayesian mixture of native and contaminating transcriptomes (DecontX)?
The same starting point — a raw UMI count matrix plus an empty-droplet "soup" profile — yields two very different contamination estimates depending on the modeling assumption:
The split traces back to one root decision — global rate subtraction vs per-cell mixture inference. SoupX is faster and integrates directly into Seurat/Seurat-cluster workflows, but its accuracy depends on the quality of the cluster assignment used to select marker genes; DecontX's variational inference sidesteps that dependency and returns a per-cell contamination fraction (ε) that can itself be used as a QC covariate.
Quick Reference: Choosing the Right Tool
By Preprocessing Task
- Doublet Removal: DoubletFinder (Seurat) or Scrublet (Scanpy)
- Ambient RNA: SoupX (simple), DecontX (Bayesian) — for molecular depletion of abundant transcripts, see scCLEAN
- Feature Selection: DELVE for trajectory-preserving gene selection
- Comprehensive QC: SCTK for unified pipeline with multiple algorithms
By Data Scale
- <100K cells: Standard tools (DoubletFinder, SoupX)
- 100K-1M cells: SnapATAC2, Scrublet
- >1M cells: scBaseCount ecosystem, scPRINT foundation model
By Technology
- 10x scRNA-seq: Full toolkit support
- snRNA-seq: Ambient RNA guidelines, specialized parameters
- scATAC-seq: SnapATAC2
🛠️ Hands-On Practice
The steps below walk through a minimal but realistic single-cell QC run using Python/Scanpy — from a cell-called 10x Genomics matrix to a clean, doublet-filtered AnnData object ready for clustering. All code runs in a standard conda environment and is designed to be self-contained.
Environment & packages
Install the core Python QC stack into a fresh environment. Scrublet handles doublet detection; miQC (via rpy2) or adaptive MAD thresholds replace hard cutoffs for cell filtering.
# conda / mamba recommended
conda create -n scqc python=3.10 -y
conda activate scqc
pip install scanpy anndata scrublet scvi-tools doubletdetection
# optional ambient-RNA correction (R bridge)
# pip install rpy2
# R: install.packages(c("SoupX", "celda")) # celda for decontX
Hardware. A standard laptop (8 GB RAM) handles datasets up to ~50k cells; for 100k+ cells use a compute node with 32–64 GB RAM. GPU is only required if running scVI-SOLO doublet detection.
Data structures & formats
AnnData— central object:adata.X(count matrix),adata.obs(per-cell metadata),adata.var(per-gene metadata)- Per-cell QC metrics computed by
sc.pp.calculate_qc_metrics():n_genes_by_counts,total_counts,pct_counts_mt - Raw vs filtered matrix — 10x output contains
raw_feature_bc_matrix/(all barcodes) andfiltered_feature_bc_matrix/(cell-called barcodes); load raw for ambient correction, filtered for standard QC - Doublet scores — stored in
adata.obs["doublet_score"]andadata.obs["predicted_doublet"]after Scrublet - h5ad — HDF5-backed
AnnDataformat; save withadata.write_h5ad()for downstream sharing
Minimal code walkthrough
Load the cell-called (filtered) 10x matrix, apply a basic floor filter, annotate mitochondrial genes, compute QC metrics, visualise distributions, apply log-space MAD adaptive thresholds, run Scrublet, and write a clean h5ad.
import scanpy as sc
import numpy as np
from scipy.stats import median_abs_deviation
# 1. Load the CELL-CALLED matrix for standard QC.
# Use filtered_feature_bc_matrix (CellRanger's called cells). The
# raw_feature_bc_matrix holds ALL barcodes (mostly empty droplets) and is
# only needed for ambient-RNA correction (SoupX / decontX), not cell QC.
# read_10x_mtx loads genes x barcodes and returns it as cells x genes.
adata = sc.read_10x_mtx(
"filtered_feature_bc_matrix/",
var_names="gene_symbols",
cache=True,
)
adata.var_names_make_unique()
# 2. Basic floor filter: drop near-empty cells and never-detected genes
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
# 3. Annotate mitochondrial genes (human "MT-"; mouse uses lowercase "mt-")
adata.var["mt"] = adata.var_names.str.startswith("MT-")
# 4. Compute per-cell QC metrics (log1p=True adds the log-scaled columns used below)
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt"], percent_top=None, log1p=True, inplace=True
)
# 5. Visualise distributions (spot outlier thresholds)
sc.pl.violin(adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4, multi_panel=True)
# 6. MAD-based adaptive filtering. Count metrics are right-skewed, so threshold
# on the log1p values; 5 MADs from the median is the common sc convention.
def is_outlier(metric, n_mads=5):
med = np.median(metric)
mad = median_abs_deviation(metric)
return (metric < med - n_mads * mad) | (metric > med + n_mads * mad)
outlier = (
is_outlier(adata.obs["log1p_total_counts"]) |
is_outlier(adata.obs["log1p_n_genes_by_counts"])
)
keep = ~outlier & (adata.obs["pct_counts_mt"] < 20) # hard upper cap for mito
adata = adata[keep].copy()
print(f"Cells after QC filter: {adata.n_obs}")
# 7. Doublet detection with Scrublet (scanpy wrapper). Needs RAW counts —
# adata.X has not been normalised yet. Set expected_doublet_rate from the
# 10x loading chart (roughly 0.8% per 1,000 cells recovered).
sc.pp.scrublet(adata, expected_doublet_rate=0.06)
adata = adata[~adata.obs["predicted_doublet"]].copy()
print(f"Cells after doublet removal: {adata.n_obs}")
# 8. Save QC-filtered object
adata.write_h5ad("adata_qc.h5ad")
print("Saved adata_qc.h5ad")
Common pitfalls & tips
- One-size-fits-all thresholds hurt. A hard cutoff of >200 genes / <5% mito fails for hepatocytes (high mito) or RBCs (few genes). Use MAD-based adaptive thresholds per sample or per cell type.
- Ambient correction before cell filtering. Run SoupX or decontX on the raw (unfiltered) matrix first; correcting already-filtered data underestimates the ambient profile from empty droplets.
- Doublet rate scales with cell loading. Scrublet's default expected doublet rate (~5%) underestimates for high-density 10x runs (>10k cells loaded); set
expected_doublet_rateexplicitly based on the 10x loading chart. - Batch-wise QC is essential. Pool-level thresholds mask lane-specific technical artifacts; run QC independently per sample/batch before integration.
- Do not remove biologically high-mito cells blindly. Cardiomyocytes, platelets, and cells under oxidative stress have genuinely high mitochondrial fractions; inspect marker genes before discarding.
- Scrublet needs unnormalised counts. Always pass raw integer counts (
adata.Xbeforesc.pp.normalize_total) to Scrublet or SOLO; normalised or log-transformed input distorts the simulated doublet distribution.