CellVoteR: Ensemble Cell Type Annotation for Single-Cell RNA-seq

Overview

CellVoteR is an ensemble-based pipeline for robust cell type annotation in single-cell RNA-seq (scRNA-seq) data. Rather than relying on a single classification strategy, CellVoteR integrates four complementary annotation methods across two feature spaces, then resolves disagreements through a principled consensus voting step.

The core design philosophy is:

Divide and conquer — broadly triage cells into lineages before applying fine-resolution annotation, preventing dominant populations from masking rare cell types
Redundancy — four methods running in parallel reduces sensitivity to the failure modes of any single approach
Separation of concerns — annotation (slow, compute-intensive) and consensus resolution (fast, parameter-sensitive) are decoupled, so the user can re-tune voting without repeating the pipeline

In order to run the complete workflow, CellVoteR requires two inputs to be supplied:

A raw gene-by-cell counts matrix (sparse dgCMatrix, RDS, or MTX triplet file).
A marker configuration — a structured list of broad and fine cell type marker genes.

Installation

Currently, the package can be installed directly from Github:

# install.packages("devtools")
devtools::install_github("ajxa/CellVoteR")

library(CellVoteR)

Detailed Pipeline Steps

Step 1: Preparing Marker Inputs

Markers are the backbone of CellVoteR’s annotation strategy. They are organised into two tiers:

Broad markers (lineage-specific)

Broad markers define coarse cell lineages (e.g. Immune, Vasculature, Other). Therefore, they must be:

Small — typically 2–5 genes per category.
Mutually exclusive — no gene should appear in more than one broad category.
Biologically diagnostic — genes that robustly delineate lineages even in heterogeneous datasets.

These broad markers are loaded and then configured with build_broad_marker_config(), which assigns expression thresholds and priority rankings used for tie-breaking when a cell passes multiple broad categories.

Fine markers (cell-type specific)

Fine markers define sub-populations within each broad lineage (e.g. B cell, T cell, NK cell within Immune). They can be larger gene sets and are used for Fisher’s Exact Test scoring during fine annotation. These marker do not need to be mutually exclusive, but should sufficiently distinguish between to cell types from a common lineage, e.g, T cells vs B cells and Mural cells vs Endothelial cells.

Loading markers

User-supplied markers can be loaded from either Excel, CSV, or TXT files. The files must be structured to comprise four columns: type (broad/fine), category, label, and marker:

type	category	label	marker
broad	immune		PTPRC
broad	vasculature		CDH5
broad	vasculature		VWF
fine	immune	T cell	CD2
fine	immune	T cell	CD3D
fine	immune	T cell	IL32
fine	immune	B cell	CD79A
fine	immune	B cell	CD79B
fine	vasculature	Mural cell	IGFBP7
fine	vasculature	Mural cell	FN1
fine	vasculature	Endothelial	A2M
fine	vasculature	Endothelial	IGFBP7

When defining broad category markers, leave the label field blank. For fine cell type markers within that broad category, assign a label.

markers <- load_markers(file_path = "path/to/input_markers.xlsx")

# Inspect the structure
str(markers$broad)   # named list of character vectors
str(markers$fine)    # nested named list: broad category > fine cell type > genes

Configuring broad markers

build_broad_marker_config() processes the raw broad marker list, attaching expression thresholds and priority ranks used during the enrichment-based annotation step.

markers$broad <- build_broad_marker_config(
  marker_list       = markers$broad,
  priority_order    = c("vasculature", "immune"),  # higher priority listed first
  default_threshold = 0.25                          # default logcounts threshold
)

# Each broad category now has markers, expr_threshold, coexp_min, and priority
str(markers$broad$immune)

The priority_order argument controls tie-breaking when a cell passes expression thresholds for more than one broad category - categories listed earlier receive a lower (higher priority) numeric rank.

Step 2: Object Creation & QC

CellVoteR works natively with SingleCellExperiment objects. Use create_sce() to construct one from your raw data.

From (in-memory) sparse matrix

sce <- create_sce(
  counts        = my_sparse_matrix,  # dgCMatrix, genes x cells
  cell_metadata = my_metadata_df     # data.frame, one row per cell (optional)
)

From file path

create_sce() also accepts file paths, which is useful for large datasets where the matrix is stored on disk:

# From RDS files
sce <- create_sce(
  counts        = "path/to/counts.rds",
  cell_metadata = "path/to/metadata.rds"   # also accepts .csv or .tsv
)

# From MTX triplet files
sce <- create_sce(
  mtx_file   = "path/to/matrix.mtx.gz",
  cells_file = "path/to/barcodes.tsv",
  genes_file = "path/to/features.tsv"
)

Step 3: Quality Control

assess_cell_quality() calculates per-cell QC metrics and optionally removes low-quality cells before downstream analysis.

sce <- assess_cell_quality(sce, remove_failed_cells = TRUE)

Cells that fail QC are flagged in colData(sce)$QC_PASS. Setting remove_failed_cells = TRUE subsets the object to passing cells only.

Step 4: Normalisation

CellVoteR uses a pooling-based normalisation strategy (Lun et al. 2016) via scran::computePooledFactors, followed by log-normalisation with scuttle::logNormCounts.

sce <- normalize_counts(sce)

This adds a logcounts assay to the SCE and sets sizeFactors(). The logcounts assay is required by all downstream steps.

Step 5: Building Analysis Tracks

The prepare_sce() function co-ordinates the key preprocessing step. This function performs/applies, the following calculations and logic:

Validates broad and fine marker configurations against the expression matrix.
Builds two independent feature spaces — the full HVG space and the reduced marker-defined space.
Runs PCA and unsupervised clustering (Leiden via SNN graph) on each space.
Attaches the marker configuration and filtered fine markers to the SCE metadata.
Stores the reduced feature space as an altExp named “user_panel”.

sce <- prepare_sce(sce, markers)

After this step, the initial SingleCellExepriment object structure is as follows:

sce
├── assays: counts, logcounts
├── rowSubset("broad_hvgs")           ←    HVGs used for broad clustering
├── reducedDim("PCA_broad_hvg")       ←    PCA on full HVG space
├── colData$cluster_broad_hvg         ←    Leiden clusters (full space)
├── colData$cluster_broad_reduced     ←    Leiden clusters (reduced space)
├── metadata$marker_config            ←    full marker configuration
├── metadata$filtered_fine_markers    ←    fine markers present in data
├── metadata$missing_by_label         ←    per-label missing marker report
└── altExp("user_panel")              ←    reduced feature SCE
    ├── assays: counts, logcounts
    ├── reducedDim("PCA")
    ├── colData$cluster
    ├── metadata$marker_config
    ├── metadata$filtered_fine
    └── metadata$params

Automatic parameter estimation

Clustering parameters (number of PCs, SNN neighbourhood size $k$ , Leiden resolution) are estimated automatically from cell count using bounded $log/sqrt$ scaling via estimate_cluster_params(). However, these parameters can be overridden if required:

sce <- prepare_sce(
  sce,
  markers,
  n_hvgs     = 3000L,
  n_pcs      = 30L,
  k          = 20L,
  resolution = 0.8
)

Marker overlap reporting

In some cases, you may have fine markers that are partially missing from your dataset: the prepare_sce() function captures this information and details the labels that are most affected. which can be inspected as follows:

# Inspect missing marker report after prepare_sce()
metadata(sce)$missing_by_label

Missing (in the data) marker are automatically removed from the fine marker sets used for scoring, so annotation still proceeds with the genes that are present. This only occurs when the total number of missing markers is <50% of the total distinct markers supplied - this parameter can be relaxed or tightened accordingly, by altering the overlap_feat_percent argument:


prepare_sce(
  sce,
  markers,
  overlap_feat_percent = 75       # more stringent
)

Step 6: Annotation Methods

The run_cellvoter() orchestrates all of the individual annotation pipelines and returns a named list of per-cell label factors, one per method:

results <- run_cellvoter(sce)

What this does internally

Four primary methods and two global tie-breakers are run:

Method	Feature space	Broad strategy	Subcluster feature mode
Method 1	Full (HVG)	Cluster-based	HVG
Method 2	Reduced (panel)	Cluster-based	All
Method 3	Full (HVG)	Enrichment-based	HVG
Method 4	Reduced (panel)	Enrichment-based	All
Tie-breaker 1	Full (HVG)	None (global)	—
Tie-breaker 2	Reduced (panel)	None (global)	—

Each primary method follows the same pipeline:

broad annotation
      ↓
subcluster_labels()
      ↓
rank_cluster_markers()          ←   DE testing per sub-cluster
      ↓
extract_top_markers()           ←   select top N genes per cluster
      ↓
score_markers_against_panel()   ←   Fisher's Exact Test + overlap similarity
      ↓
assign_fine_labels()            ←   best label per cluster    →    mapped to cells

Broad annotation strategies

Cluster-based (annotate_broad_clusters): Runs DE testing on pre-existing unsupervised clusters. Each cluster is assigned the broad category whose curated markers have the lowest median rank among significantly up-regulated genes (FDR ≤ 0.05, AUC ≥ 0.6 by default).

Enrichment-based (annotate_broad_cells): Assigns labels directly to individual cells by aggregating expression across each broad category’s marker genes and comparing against category-specific thresholds. Does not depend on clustering.

In some datasets (highly homogeneous tumour sample), all clusters may receive the same broad label. CellVoteR detects this and retains the original cluster structure for subclustering rather than collapsing to a single group, preserving the fine-resolution information for downstream processes.

Annotation Method Results

After running run_cellvoter(), the returned SingleCellExeriment object is populated with all of th intermediate cluster labels, which can be accessed by querying the colData() columns of the object:

Column	Description
cluster_broad_hvg	Pre-existing HVG clusters from `prepare_sce()`
cluster_broad_reduced	Pre-existing reduced clusters from `prepare_sce()`
broad_cluster_m1	Broad labels, method 1
broad_cluster_sub_m1	Sub-cluster labels, method 1
broad_cluster_m2	Broad labels, method 2
broad_cluster_sub_m2	Sub-cluster labels, method 2
broad_enrichment_m3	Broad labels, method 3
broad_enrichment_sub_m3	Sub-cluster labels, method 3
broad_enrichment_m4	Broad labels, method 4
broad_enrichment_sub_m4	Sub-cluster labels, method 4

Customising parameters

The main run_cellvoter() function accepts an annotation_args list which can be used to customise various parameters of the underlying internal functions, if required. There following parameter lists that can be specified, which each control specific aspects of the underlying logic:

rank_args Controls parameters associated with rank_cluster_markers():
- assay_type
- test_type
- direction
- pval_type
- min_prop
- BPPARAM
broad_args is identical to the rank_args, but only controls the ranking inside the annotate_broad_clusters function. This is useful for altering the parameters which control how the broad cell lineage labels (e.g, immune, vasculature) are defined - these broad markers are small and highly specific and so you may wish to use a more lenient min_prop compared to the fine labels and this allows for such fine control of the process.
extract_args Controls parameters associated with extract_top_markers():
- fdr_threshold
- effect_threshold
- target_n

An following is an example of how this can be used to independently alter the broad and fine labelling logic:

results <- run_cellvoter(
  sce,
  return_full_output = TRUE,
  annotation_args = list(

    # Controls ranking inside annotate_broad_clusters (methods 1 and 2)
    # Lenient min_prop appropriate for small, specific broad marker sets
    broad_args = list(
      test_type = "wilcox",
      min_prop  = 0.1
    ),

    # Controls ranking inside .run_fine_annotation (all six methods)
    # Stricter settings for fine sub-cluster marker extraction
    rank_args = list(
      test_type = "wilcox",
      min_prop  = 0.25
    ),

    # Controls top marker extraction
    extract_args = list(
      fdr_threshold    = 0.05,
      effect_threshold = 0.6,
      target_n         = 100L
    )
  )
)

broad_args vs rank_args: these both control rank_cluster_markers() but at different pipeline stages. broad_args affects how broad lineage labels are assigned (methods 1 and 2 only); rank_args affects how sub-cluster marker genes are extracted for Fisher scoring (all six methods). Methods 3 and 4 use rank_args only, as the enrichment-based broad step does not call rank_cluster_markers().

Accessing full outputs

Setting return_full_output = TRUE, returns the per-cluster Fisher scores and similarity values are for every method (by default this is set to FALSE).

# Per-cluster score table for method 1
results$full_output$method_1$scores

# Per-cluster score table for tie-breaker 2
results$full_output$global_2$scores

Step 7: Resolving Consensus Labels

Consensus resolution is intentionally a separate step. This means you can adjust voting parameters and re-run resolve_consensus_labels() as many times as needed without re-running the annotation pipeline.

consensus <- resolve_consensus_labels(
  label_list        = results$labels,
  method_names      = results$method_names,
  tie_breaker_names = results$tie_breaker_names,
  unassigned_label  = "unknown",
  allow_even_split  = FALSE,
  ordered_tiebreak  = TRUE
)

Decision hierarchy

For each cell, the following logic is applied in order:

Strong majority — if one label receives strictly more than 50% of method votes (3 of 4 by default) it is assigned immediately.
Tie-breaker agreement — if the two tie-breakers agree with each other and their shared label matches the leading candidate, that label is assigned.
Ordered tie-breaking (ordered_tiebreak = TRUE, default) — tie-breaker 1 is tried first; if it matches the leading candidate the label is assigned. Otherwise tie-breaker 2 is tried.
Either tie-breaking (ordered_tiebreak = FALSE) — either tie-breaker agreeing with the leading candidate is sufficient, with no priority between them.
Unresolved — if every method disagrees (no leading candidate exists) or no tie-breaker resolves the split, unassigned_label is assigned.

Voting parameters

Parameter	Default	Effect
allow_even_split	FALSE	When TRUE, a 2-of-4 plurality is accepted as majority
ordered_tiebreak	TRUE	When FALSE, either tie-breaker can resolve a split
unassigned_label	“Unknown”	Label for unresolved cells

Attaching labels to the SCE

sce$cellVoteR_label  <- consensus$label
sce$cellVoteR_method <- consensus$method

Step 8: Inspecting Results

Summary tables

# Final label distribution
table(sce$cellVoteR_label)

# Decision method breakdown
table(sce$cellVoteR_method)

# Cross-tabulate label vs decision method
table(sce$cellVoteR_label, sce$cellVoteR_method)

Per-method label distributions

Comparing individual method outputs before consensus is useful for understanding where methods agree and disagree:

table(results$labels$method_1)   # cluster-based, full
table(results$labels$method_2)   # cluster-based, reduced
table(results$labels$method_3)   # enrichment-based, full
table(results$labels$method_4)   # enrichment-based, reduced
table(results$labels$global_1)   # tie-breaker 1
table(results$labels$global_2)   # tie-breaker 2

Method agreement

A useful diagnostic is to look at how often all four primary methods agree:

label_df <- as.data.frame(results$labels[1:4])

# Proportion of cells where all four methods agree
mean(apply(label_df, 1, function(x) length(unique(x)) == 1L))

# Per-cell method agreement count
label_df$n_agree <- apply(label_df, 1, function(x) max(table(x)))
table(label_df$n_agree)

Re-running consensus with different parameters

# allowing even splits
consensus_liberal <- resolve_consensus_labels(
  label_list        = results$labels,
  method_names      = results$method_names,
  tie_breaker_names = results$tie_breaker_names,
  unassigned_label  = "unknown",
  allow_even_split  = TRUE
)

table(consensus_liberal$label)

# Compare unresolved rate between settings
mean(consensus$label == "unknown")
mean(consensus_liberal$label == "unknown")

Complete Workflow

The full pipeline from raw data to annotated SCE:

library(CellVoteR)

# ── 1. Load and configure markers ──────────────────────────────────────────────
markers <- load_markers(file_path = "path/to/input_markers.xlsx")

markers$broad <- build_broad_marker_config(
  marker_list       = markers$broad,
  priority_order    = c("vasculature", "immune"),
  default_threshold = 0.25
)

# ── 2. Create SCE ──────────────────────────────────────────────────────────────
sce <- create_sce(
  counts        = "path/to/counts.rds",
  cell_metadata = "path/to/metadata.rds"
)

# ── 3. QC ──────────────────────────────────────────────────────────────────────
sce <- assess_cell_quality(sce, remove_failed_cells = TRUE)

# ── 4. Normalise ───────────────────────────────────────────────────────────────
sce <- normalize_counts(sce)

# ── 5. Build analysis tracks ───────────────────────────────────────────────────
sce <- prepare_sce(sce, markers)

# ── 6. Run ensemble annotation ─────────────────────────────────────────────────
results <- run_cellvoter(sce)

# ── 7. Resolve consensus ───────────────────────────────────────────────────────
consensus <- resolve_consensus_labels(
  label_list        = results$labels,
  method_names      = results$method_names,
  tie_breaker_names = results$tie_breaker_names,
  unassigned_label  = "unknown"
)

# ── 8. Attach labels ───────────────────────────────────────────────────────────
sce$cellVoteR_label  <- consensus$label
sce$cellVoteR_method <- consensus$method

# ── 9. Inspect ─────────────────────────────────────────────────────────────────
table(sce$cellVoteR_label)
table(sce$cellVoteR_method)

Tips and Troubleshooting

Low marker overlap

If prepare_sce() warns about low fine marker overlap, inspect which labels are most affected:

metadata(sce)$missing_by_label

Consider whether the missing genes are platform-specific (e.g. not captured by your assay technology), or whether alternative gene symbols should be used.

High unresolved rate

If many cells are labelled "unknown" after consensus, try:

# 1. Allow even splits
consensus <- resolve_consensus_labels(
  ...,
  allow_even_split = TRUE
)

# 2. Disable ordered tie-breaking so either tie-breaker can resolve
consensus <- resolve_consensus_labels(
  ...,
  ordered_tiebreak = FALSE
)

# 3. Inspect which method combinations are causing disagreements
table(results$labels$method_1, results$labels$method_2)

Collapsed broad labels

If all clusters receive the same broad label, CellVoteR retains the original cluster structure automatically. This is expected behaviour for highly homogeneous datasets (e.g. a sample consisting entirely of tumour cells). In this case, the numeric cluster prefixes (e.g. 1_sc1, 2_sc1) trigger testing against the full fine marker panel rather than a lineage-specific subset.

Large datasets

For datasets exceeding available RAM, convert the SCE to HDF5-backed storage after create_sce():

HDF5Array::saveHDF5SummarizedExperiment(sce, dir = "my_hdf5_sce")
sce <- HDF5Array::loadHDF5SummarizedExperiment("my_hdf5_sce")

Parallelisation

Key functions accept a BPPARAM argument for parallelisation via BiocParallel:

library(BiocParallel)

results <- run_cellvoter(
  sce,
  annotation_args = list(
    broad_args = list(BPPARAM = MulticoreParam(4)),
    rank_args  = list(BPPARAM = MulticoreParam(4))
  )
)

Session Info

#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices datasets  utils     methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.39       desc_1.4.3          R6_2.6.1           
#>  [4] bookdown_0.46       fastmap_1.2.0       xfun_0.57          
#>  [7] cachem_1.1.0        knitr_1.51          htmltools_0.5.9    
#> [10] rmarkdown_2.31      lifecycle_1.0.5     cli_3.6.6          
#> [13] sass_0.4.10         pkgdown_2.2.0       textshaping_1.0.5  
#> [16] jquerylib_0.1.4     renv_1.0.7          systemfonts_1.3.2  
#> [19] compiler_4.4.3      tools_4.4.3         ragg_1.5.2         
#> [22] bslib_0.10.0        evaluate_1.0.5      rmdformats_1.0.4   
#> [25] yaml_2.3.12         BiocManager_1.30.27 jsonlite_2.0.0     
#> [28] rlang_1.2.0         fs_2.1.0

Shoaib Ajaib

2026-05-13