This vignette covers a minimal example for how a single cell RNA seq method benchmark should be done using CellBench. We will use the single cell data packaged with CellBench and external wrappers from GitHub packages. For 2 datasets, 3 normalisation methods, 2 imputation methods and 3 clustering methods will be tested, and the final clustering results will be evaluated using the adjusted Rand index metric.
library(CellBench)
# These need to be installed from github using install_github from devtools or
# remotes packages
library(NormCBM) # install_github("shians/NormCBM")
library(ClusterCBM) # install_github("shians/ClusterCBM")
library(ImputeCBM) # install_github("shians/ImputeCBM")
# Tidyverse packages
library(dplyr)
library(purrr)
library(forcats)
library(ggplot2)
We load in datasets using load_sc_data()
, this returns a list
of SingleCellExperiments
objetcs ready for use with CellBench. We pick out the CelSeq and DropSeq datasets because they are relatively small and can be run in a reasonable time.
# Get the smaller CelSeq and DropSeq single cell datasets from mixology data
datasets <- load_sc_data()[c("sc_celseq", "sc_dropseq")]
We create lists of methods from select wrappers in from the NormCBM
, ImputeCBM
and ClusterCBM
GitHub packages. This set of methods were chosen for runtime and not performance, but note greater runtime does not guarantee performance.
Each of these methods take a single SingleCellExperiment
argument and return a SingleCellExperiment
result with new data added to the appropriate slots. This approach makes the wrappers more general and retains maximum data for downstream use at the cost of greater memory usage.
# For normalisation and imputation we also have a "none" method which uses the
# identity function to simply return the object without any change
normalisation <- list(
"none" = identity,
"linnorm" = norm_linnorm,
"scran" = norm_scran,
"tmmwzp" = norm_tmmwzp
)
imputation <- list(
"none" = identity,
"drimpute" = impute_drimpute,
"knn_smooth" = impute_knn_smooth
)
cluster <- list(
"race_id" = cluster_race_id,
"seurat" = cluster_seurat,
"tscan" = cluster_tscan
)
# Wrapper for adjusted rand index written for general case
rand_index <- function(sce, cluster_col, truth_col) {
cluster <- colData(sce)[, cluster_col]
truth <- colData(sce)[, truth_col]
mclust::adjustedRandIndex(cluster, truth)
}
# Create the metric for the specific case
metric <- list(
ARI = function(x) { rand_index(x, "cluster_id", "cell_line") }
)
We can have a look at an example of a very simple wrapper
normalisation$tmmwzp
## function(sce) {
## sizeFactors(sce) <- edgeR::calcNormFactors(counts(sce), method = "TMMwzp")
## sce <- scater::normalize(sce)
##
## return(sce)
## }
## <bytecode: 0x7fed6923c718>
## <environment: namespace:NormCBM>
or a more slightly more complex wrapper
cluster$tscan
## function (sce)
## {
## expr <- logcounts(sce)
## res <- TSCAN::exprmclust(expr)
## res <- res$clusterid
## colData(sce)$cluster_id <- factor(res)
## return(sce)
## }
## <bytecode: 0x7fed69336d28>
## <environment: namespace:ClusterCBM>
With the wrappers set up we can run our pipeline with very minimal code.
res_metric <- datasets %>%
apply_methods(normalisation) %>%
apply_methods(imputation) %>%
apply_methods(cluster) %>%
apply_methods(metric)
Looking at the results, we will see columns corresponding to the methods applied as well as a final column containing a list of the results of the pipelines.
res_metric
## # A tibble: 72 x 6
## data normalisation imputation cluster metric result
## <fct> <fct> <fct> <fct> <fct> <list>
## 1 sc_cels… none none race_id ARI <dbl [1…
## 2 sc_cels… none none seurat ARI <dbl [1…
## 3 sc_cels… none none tscan ARI <task_r…
## 4 sc_cels… none drimpute race_id ARI <dbl [1…
## 5 sc_cels… none drimpute seurat ARI <dbl [1…
## 6 sc_cels… none drimpute tscan ARI <dbl [1…
## 7 sc_cels… none knn_smooth race_id ARI <dbl [1…
## 8 sc_cels… none knn_smooth seurat ARI <dbl [1…
## 9 sc_cels… none knn_smooth tscan ARI <task_r…
## 10 sc_cels… linnorm none race_id ARI <dbl [1…
## # … with 62 more rows
We will see some elements returned as “task_error” objects, indicating that the computation has failed, this is commonly due to the abundance of zero values of single cell data causing issues with mathematical methods. But the majority of results returned successfully, so we filter out the errors and clean up our results into a nicer form.
# Helper function for use with dplyr::filter, since result column is a list that
# is not compatible with the usual vectorised functions
is_task_error <- function(x) {
purrr::map_lgl(x, function(obj) is(obj, "task_error"))
}
# Filter out errored entries, convert results to a numeric column and sort based
# on the ARI
res_metric %>%
filter(!is_task_error(result)) %>%
mutate(result = as.numeric(result)) %>%
arrange(desc(result)) %>%
print(n = 20)
## # A tibble: 64 x 6
## data normalisation imputation cluster metric result
## <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 sc_celseq none none seurat ARI 0.977
## 2 sc_celseq none drimpute seurat ARI 0.977
## 3 sc_celseq none knn_smooth seurat ARI 0.977
## 4 sc_celseq linnorm none seurat ARI 0.977
## 5 sc_celseq linnorm drimpute seurat ARI 0.977
## 6 sc_celseq linnorm knn_smooth seurat ARI 0.977
## 7 sc_celseq scran none seurat ARI 0.977
## 8 sc_celseq scran drimpute seurat ARI 0.977
## 9 sc_celseq scran knn_smooth seurat ARI 0.977
## 10 sc_celseq tmmwzp none seurat ARI 0.977
## 11 sc_celseq tmmwzp drimpute seurat ARI 0.977
## 12 sc_celseq tmmwzp knn_smooth seurat ARI 0.977
## 13 sc_celseq tmmwzp drimpute tscan ARI 0.833
## 14 sc_dropseq none none tscan ARI 0.818
## 15 sc_dropseq scran none tscan ARI 0.818
## 16 sc_dropseq none knn_smooth tscan ARI 0.774
## 17 sc_dropseq none none seurat ARI 0.763
## 18 sc_dropseq none knn_smooth seurat ARI 0.763
## 19 sc_dropseq linnorm none seurat ARI 0.763
## 20 sc_dropseq linnorm drimpute seurat ARI 0.763
## # … with 44 more rows
We perform some table manipulation to get the data into a form that is appropriate for plotting
# Filter out errors, convert results to numeric and remove the metric column
# since we only have one
res_metric_filtered <- res_metric %>%
filter(!is_task_error(result)) %>%
mutate(result = as.numeric(result)) %>%
select(-metric)
# pipeline_collapse generates a new column that summarises the pipeline into a
# single string
res_metric_summarised <- res_metric_filtered %>%
pipeline_collapse(drop.steps = FALSE, sep = " > ")
res_metric_summarised
## # A tibble: 64 x 6
## data normalisation imputation cluster pipeline result
## <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 sc_ce… none none race_id sc_celseq… 0.372
## 2 sc_ce… none none seurat sc_celseq… 0.977
## 3 sc_ce… none drimpute race_id sc_celseq… 0.372
## 4 sc_ce… none drimpute seurat sc_celseq… 0.977
## 5 sc_ce… none drimpute tscan sc_celseq… 0.531
## 6 sc_ce… none knn_smooth race_id sc_celseq… 0.372
## 7 sc_ce… none knn_smooth seurat sc_celseq… 0.977
## 8 sc_ce… linnorm none race_id sc_celseq… 0.372
## 9 sc_ce… linnorm none seurat sc_celseq… 0.977
## 10 sc_ce… linnorm drimpute race_id sc_celseq… 0.372
## # … with 54 more rows
Now we can plot all the results. We will see that seurat and race_id are invariant to the upstream normalisation or imputation. This is actually because both methods perform their own normalisation and ignore imputed expression values.
# Reorder the factor levels so that bar plot is ranked properly
plot_data <- res_metric_summarised %>%
arrange(result) %>%
mutate(
pipeline = fct_inorder(as.character(pipeline)),
cluster = fct_inorder(as.character(cluster))
)
plot_data %>%
ggplot(aes(x = pipeline, y = result, fill = cluster)) +
geom_bar(stat = "identity") +
theme_classic() +
xlab("Pipeline (normalization > imputation > clustering)") +
ylab("Adjusted Rand Index") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 15))
Plotting all the data could be overwhelming, we are usually only interested in the best (and maybe worst) performing pipelines. Because our data is in a tidy format, we can use the dplyr functions to filter down to the data of interest.
# Keep only CelSeq data
plot_data <- res_metric_summarised %>%
filter(data == "sc_celseq")
# Keep only the 4 best results for each clustring method
plot_data <- plot_data %>%
group_by(cluster) %>%
arrange(desc(result)) %>%
slice(1:4) %>%
ungroup()
plot_data <- plot_data%>%
arrange(result) %>%
mutate(
pipeline = fct_inorder(as.character(pipeline)),
cluster = fct_inorder(as.character(cluster))
)
plot_data %>%
ggplot(aes(x = pipeline, y = result, fill = cluster)) +
theme_classic() +
geom_bar(stat = "identity") +
xlab("Pipeline (normalization > imputation > clustering)") +
ylab("Adjusted Rand Index") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 15)) +
guides(fill = guide_legend(title = "Clustering Method")) +
ggtitle("Four highest ranked pipelines for each clustering method (CelSeq)")
# Keep only DropSeq data
plot_data <- res_metric_summarised %>%
filter(data == "sc_dropseq")
# Keep only the 4 best results for each clustring method
plot_data <- plot_data %>%
group_by(cluster) %>%
arrange(desc(result)) %>%
slice(1:4) %>%
ungroup()
plot_data <- plot_data %>%
arrange(result) %>%
mutate(
pipeline = fct_inorder(as.character(pipeline)),
cluster = fct_inorder(as.character(cluster))
)
plot_data %>%
ggplot(aes(x = pipeline, y = result, fill = cluster)) +
theme_classic() +
geom_bar(stat = "identity") +
xlab("Pipeline (normalization > imputation > clustering)") +
ylab("Adjusted Rand Index") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 15)) +
guides(fill = guide_legend(title = "Clustering Method")) +
ggtitle("Four highest ranked pipelines for each clustering method (DropSeq)")
The structure of CellBench allows the core benchmarking code to be presented very cleanly as in the “Running Pipeline” section. The complexity is elsewhere in the wrappers. By returning results in a tidy format, they can be easily manipulated using the rich set of tools provided by the tidyverse to investigate more specific details.