Single-Cell Quality Control & Preprocessing

Core QC Pipelines & Frameworks

Goal: "I need a comprehensive, end-to-end quality control workflow that handles multiple QC tasks in a unified framework."

SCTK (singleCellTK)

2022 QC Pipeline Nat Commun

Comprehensive R/Bioconductor pipeline integrating 7 doublet detection and 3 decontamination algorithms into a unified framework. Features automated parameter optimization, consensus-based filtering, and supports both interactive (Shiny GUI) and command-line workflows.

Paper GitHub

Reference Optimization

2023 Alignment Nat Methods

Systematic study showing that reference selection significantly impacts scRNA-seq analysis. Demonstrates how to optimize gene annotations and transcriptome models for improved quantification accuracy, especially for intron-containing reads.

Paper GitHub

Doublet Detection

What a doublet is — and how simulation catches it

Two cells captured in one droplet share a barcode, so they look like a single cell with a blended profile. Detectors spot them by simulating artificial doublets and asking which real cells resemble them.

Key insight: tools like Scrublet and DoubletFinder generate synthetic doublets by averaging random pairs of real cells, embed everything together, and score each real cell by how many simulated doublets sit nearby. Cells in dense simulated-doublet neighborhoods are flagged and removed before analysis.

Goal: "I need to identify and remove multiplets (droplets containing two or more cells) from my single-cell data to ensure analysis quality."

DoubletFinder

2019 Doublet Detection Cell Systems

Identifies doublets by creating artificial doublets (pK-NN approach) and measuring proximity in gene expression space. Features parametric parameter sweeping (pK/pN optimization) for dataset-specific calibration. Integrated into Seurat workflows.

Paper GitHub

Scrublet

2019 Doublet Detection Cell Systems

Python-based doublet detection using simulated doublet enrichment analysis. Computes doublet scores based on k-nearest neighbor density in PCA space. Fast, scalable, and well-integrated with Scanpy ecosystem.

Paper GitHub

Ambient RNA & Contamination Removal

Goal: "I need to remove cell-free RNA contamination (soup) from droplet-based single-cell data to improve signal-to-noise ratio."

SoupX

2020 Ambient RNA GigaScience

Estimates and removes ambient RNA contamination using empty droplet profiles. Uses cluster-specific marker genes to estimate contamination fraction per cell. Works with 10x Genomics data and integrates with Seurat.

Paper GitHub

DecontX

2020 Ambient RNA Genome Biology

Bayesian method for decontamination using variational inference. Models each cell's expression as a mixture of native and contaminating RNA. Provides uncertainty estimates and works without clustering pre-requisites.

Paper GitHub

scCLEAN

2025 Molecular / Library Prep Nat Commun

Molecular (wet-lab) method, not a computational tool. Single-cell CRISPRclean uses CRISPR/Cas9 to target and remove highly abundant, uninformative transcripts (both genomic and transcriptomic) from prepared cDNA before sequencing, redistributing roughly half of reads toward lower-abundance transcripts. Improves detection of rare/low-expression genes in a way that cannot be matched by deeper sequencing alone; complements — rather than replaces — computational decontamination.

Paper GitHub

Ambient RNA Analysis (snRNA-seq)

2022 Review/Analysis Neuron

Comprehensive analysis of ambient RNA contamination specifically in single-nucleus RNA-seq (snRNA-seq). Provides guidelines for contamination assessment and benchmarks decontamination tools for nuclear preparations.

Paper

Feature Selection & Dimensionality Reduction

Goal: "I need to select informative features or reduce dimensionality while preserving biological signal for downstream analysis."

DELVE

2024 Feature Selection Nat Commun

Dynamic feature selection method that identifies genes with coherent expression patterns across local cell neighborhoods. Uses Laplacian scores to select features that capture continuous biological processes like differentiation trajectories.

Paper GitHub

SnapATAC2

2024 Dimensionality Reduction Nat Methods

Fast, scalable framework for single-cell ATAC-seq analysis with novel dimensionality reduction using spectral embedding. Handles millions of cells efficiently with out-of-core computation. Also applicable to RNA-seq data.

Paper GitHub

Foundation Models for Single-Cell Analysis

Goal: "I want to leverage large pre-trained models for zero-shot or transfer learning capabilities in quality control and preprocessing tasks."

scPRINT

2025 Foundation Model Nat Commun

Large foundation model pre-trained on 50M+ cells for diverse single-cell tasks. Provides cell and gene embeddings enabling zero-shot annotation, denoising, and batch correction. Supports multi-task learning across preprocessing applications.

Paper GitHub

SIGnature

2026 Explainable AI Nat Biotechnol

Applies gradient-based attribution methods (Integrated Gradients, Input×Gradient, DeepLIFT) to scRNA-seq foundation models to score per-gene importance at the single-cell level. Attribution scores are strikingly robust to the technical artifacts that QC tries to manage: they correlate far less with per-cell UMI counts and library complexity than raw expression, automatically demote ribosomal and mitochondrial genes, and remain 93% stable under 50% simulated dropout. This makes SIGnature's gene-importance signal substantially less confounded by sequencing depth, dropout events, and ambient contamination than log-normalized counts alone.

Attributions promote cell-type marker genes and lineage TFs (GATA3, RORC, FOXP3) while demoting ribo/mito genes vs expression (Wilcoxon P<0.01)
Top-ranked genes 93% stable under 50% dropout simulation — robust to the very missingness QC filters aim to reduce
Attribution-based NMF gene programs less batch/study-specific than expression-based; FOXP3 a top Treg gene in 98% of attribution configs vs 4% for expression
Attribution signature scoring tops F1 in 23/32 tasks, beating ANS, Scanpy, UCell, JASMINE, and mean expression
Scales to ~22 million cells / 412 studies in minutes; enables large-scale QC-aware gene-program discovery

Paper

Metacell & Statistical Methods

Goal: "I need to aggregate cells into metacells for noise reduction or perform robust statistical analysis on single-cell data."

MetaQ

2025 Metacell Nat Commun

Metacell inference via deep single-cell quantization. Quantizes cells into a discrete codebook where each entry (a metacell) can reconstruct the similar cells it represents, reducing complexity from exponential to linear with constant memory. Scales to arbitrarily large datasets and supports both uni- and multi-omics data while preserving biological heterogeneity.

Paper GitHub

Memento

2024 Statistical Method Cell

Method-of-moments framework for differential expression that accounts for technical noise and sampling variability. Provides rigorous statistical inference for mean and variance differences with proper false discovery control.

Paper GitHub

Data Repositories & Standardization

Goal: "I need access to large-scale, uniformly processed single-cell datasets or want to standardize my data processing pipeline."

scBaseCount (formerly scBaseCamp)

2025 Repository bioRxiv

AI agent-curated, continually expanding repository — now over 500M uniformly processed cells across 27 organisms and 75 tissues (the largest freely accessible single-cell collection). An AI agent (SRAgent) automates dataset discovery and metadata extraction from the SRA, and the scRecounter Nextflow pipeline performs standardized reprocessing. Released as part of the Arc Virtual Cell Atlas. Note: first circulated under the name "scBaseCamp" (230M cells); later renamed scBaseCount as it expanded.

Paper SRAgent scRecounter

⚖️ Side-by-Side: Ambient RNA Removal — Subtraction vs Bayesian Mixture

SoupX and DecontX both correct ambient RNA contamination in droplet-based scRNA-seq, but they diverge on the fundamental statistical question: should contamination be estimated via cluster-driven marker subtraction (SoupX) or inferred as a per-cell Bayesian mixture of native and contaminating transcriptomes (DecontX)?

The same starting point — a raw UMI count matrix plus an empty-droplet "soup" profile — yields two very different contamination estimates depending on the modeling assumption:

SoupX

cluster-marker subtraction (2020)

↓

Estimate global soup fraction (ρ)

use cluster-specific marker genes to infer contamination rate per cluster

↓

Subtract ρ × soup profile

scale expected ambient counts by gene, subtract from each cell; requires cluster pre-assignment

↓

corrected_counts ρ_per_cluster …

frequentist · requires clustering · gene-specific subtraction · Seurat-native

DecontX

Bayesian mixture model (2020)

↓

Model cell as mixture: native + ambient

each cell's counts = (1−ε)×native + ε×ambient; ε inferred per cell via variational Bayes

↓

Posterior estimate of true expression

ELBO-optimized latent topics; does not require pre-clustering; provides uncertainty estimate ε

↓

decontX_counts ε_per_cell …

Bayesian · clustering-free · per-cell uncertainty · Bioconductor-native

The split traces back to one root decision — global rate subtraction vs per-cell mixture inference. SoupX is faster and integrates directly into Seurat/Seurat-cluster workflows, but its accuracy depends on the quality of the cluster assignment used to select marker genes; DecontX's variational inference sidesteps that dependency and returns a per-cell contamination fraction (ε) that can itself be used as a QC covariate.

Quick Reference: Choosing the Right Tool

                    By Preprocessing Task
                    Doublet Removal: DoubletFinder (Seurat) or Scrublet (Scanpy)
Ambient RNA: SoupX (simple), DecontX (Bayesian) — for molecular depletion of abundant transcripts, see scCLEAN
Feature Selection: DELVE for trajectory-preserving gene selection
Comprehensive QC: SCTK for unified pipeline with multiple algorithms

                

                    By Data Scale
                    <100K cells: Standard tools (DoubletFinder, SoupX)
100K-1M cells: SnapATAC2, Scrublet
>1M cells: scBaseCount ecosystem, scPRINT foundation model

                

                    By Technology
                    10x scRNA-seq: Full toolkit support
snRNA-seq: Ambient RNA guidelines, specialized parameters
scATAC-seq: SnapATAC2

                

🛠️ Hands-On Practice

The steps below walk through a minimal but realistic single-cell QC run using Python/Scanpy — from a cell-called 10x Genomics matrix to a clean, doublet-filtered AnnData object ready for clustering. All code runs in a standard conda environment and is designed to be self-contained.

Environment & packages

Install the core Python QC stack into a fresh environment. Scrublet handles doublet detection; miQC (via rpy2) or adaptive MAD thresholds replace hard cutoffs for cell filtering.

# conda / mamba recommended
conda create -n scqc python=3.10 -y
conda activate scqc

pip install scanpy anndata scrublet scvi-tools doubletdetection
# optional ambient-RNA correction (R bridge)
# pip install rpy2
# R: install.packages(c("SoupX", "celda"))   # celda for decontX

Hardware. A standard laptop (8 GB RAM) handles datasets up to ~50k cells; for 100k+ cells use a compute node with 32–64 GB RAM. GPU is only required if running scVI-SOLO doublet detection.

Data structures & formats

AnnData — central object: adata.X (count matrix), adata.obs (per-cell metadata), adata.var (per-gene metadata)
Per-cell QC metrics computed by sc.pp.calculate_qc_metrics(): n_genes_by_counts, total_counts, pct_counts_mt
Raw vs filtered matrix — 10x output contains raw_feature_bc_matrix/ (all barcodes) and filtered_feature_bc_matrix/ (cell-called barcodes); load raw for ambient correction, filtered for standard QC
Doublet scores — stored in adata.obs["doublet_score"] and adata.obs["predicted_doublet"] after Scrublet
h5ad — HDF5-backed AnnData format; save with adata.write_h5ad() for downstream sharing

Minimal code walkthrough

Load the cell-called (filtered) 10x matrix, apply a basic floor filter, annotate mitochondrial genes, compute QC metrics, visualise distributions, apply log-space MAD adaptive thresholds, run Scrublet, and write a clean h5ad.

import scanpy as sc
import numpy as np
from scipy.stats import median_abs_deviation

# 1. Load the CELL-CALLED matrix for standard QC.
#    Use filtered_feature_bc_matrix (CellRanger's called cells). The
#    raw_feature_bc_matrix holds ALL barcodes (mostly empty droplets) and is
#    only needed for ambient-RNA correction (SoupX / decontX), not cell QC.
#    read_10x_mtx loads genes x barcodes and returns it as cells x genes.
adata = sc.read_10x_mtx(
    "filtered_feature_bc_matrix/",
    var_names="gene_symbols",
    cache=True,
)
adata.var_names_make_unique()

# 2. Basic floor filter: drop near-empty cells and never-detected genes
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

# 3. Annotate mitochondrial genes (human "MT-"; mouse uses lowercase "mt-")
adata.var["mt"] = adata.var_names.str.startswith("MT-")

# 4. Compute per-cell QC metrics (log1p=True adds the log-scaled columns used below)
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt"], percent_top=None, log1p=True, inplace=True
)

# 5. Visualise distributions (spot outlier thresholds)
sc.pl.violin(adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
             jitter=0.4, multi_panel=True)

# 6. MAD-based adaptive filtering. Count metrics are right-skewed, so threshold
#    on the log1p values; 5 MADs from the median is the common sc convention.
def is_outlier(metric, n_mads=5):
    med = np.median(metric)
    mad = median_abs_deviation(metric)
    return (metric < med - n_mads * mad) | (metric > med + n_mads * mad)

outlier = (
    is_outlier(adata.obs["log1p_total_counts"]) |
    is_outlier(adata.obs["log1p_n_genes_by_counts"])
)
keep = ~outlier & (adata.obs["pct_counts_mt"] < 20)   # hard upper cap for mito
adata = adata[keep].copy()
print(f"Cells after QC filter: {adata.n_obs}")

# 7. Doublet detection with Scrublet (scanpy wrapper). Needs RAW counts —
#    adata.X has not been normalised yet. Set expected_doublet_rate from the
#    10x loading chart (roughly 0.8% per 1,000 cells recovered).
sc.pp.scrublet(adata, expected_doublet_rate=0.06)
adata = adata[~adata.obs["predicted_doublet"]].copy()
print(f"Cells after doublet removal: {adata.n_obs}")

# 8. Save QC-filtered object
adata.write_h5ad("adata_qc.h5ad")
print("Saved adata_qc.h5ad")

Common pitfalls & tips

One-size-fits-all thresholds hurt. A hard cutoff of >200 genes / <5% mito fails for hepatocytes (high mito) or RBCs (few genes). Use MAD-based adaptive thresholds per sample or per cell type.
Ambient correction before cell filtering. Run SoupX or decontX on the raw (unfiltered) matrix first; correcting already-filtered data underestimates the ambient profile from empty droplets.
Doublet rate scales with cell loading. Scrublet's default expected doublet rate (~5%) underestimates for high-density 10x runs (>10k cells loaded); set expected_doublet_rate explicitly based on the 10x loading chart.
Batch-wise QC is essential. Pool-level thresholds mask lane-specific technical artifacts; run QC independently per sample/batch before integration.
Do not remove biologically high-mito cells blindly. Cardiomyocytes, platelets, and cells under oxidative stress have genuinely high mitochondrial fractions; inspect marker genes before discarding.
Scrublet needs unnormalised counts. Always pass raw integer counts (adata.X before sc.pp.normalize_total) to Scrublet or SOLO; normalised or log-transformed input distorts the simulated doublet distribution.