1 Introduction

This vignette covers a minimal example for how a single cell RNA seq method benchmark should be done using CellBench. We will use the single cell data packaged with CellBench and external wrappers from GitHub packages. For 2 datasets, 3 normalisation methods, 2 imputation methods and 3 clustering methods will be tested, and the final clustering results will be evaluated using the adjusted Rand index metric.

library(CellBench)

# These need to be installed from github using install_github from devtools or
# remotes packages
library(NormCBM)     # install_github("shians/NormCBM")
library(ClusterCBM)  # install_github("shians/ClusterCBM")
library(ImputeCBM)   # install_github("shians/ImputeCBM")

# Tidyverse packages
library(dplyr)
library(purrr)
library(forcats)
library(ggplot2)

We load in datasets using load_sc_data(), this returns a list of SingleCellExperiments objetcs ready for use with CellBench. We pick out the CelSeq and DropSeq datasets because they are relatively small and can be run in a reasonable time.

2 Setting Up

# Get the smaller CelSeq and DropSeq single cell datasets from mixology data
datasets <- load_sc_data()[c("sc_celseq", "sc_dropseq")]

We create lists of methods from select wrappers in from the NormCBM, ImputeCBM and ClusterCBM GitHub packages. This set of methods were chosen for runtime and not performance, but note greater runtime does not guarantee performance.

Each of these methods take a single SingleCellExperiment argument and return a SingleCellExperiment result with new data added to the appropriate slots. This approach makes the wrappers more general and retains maximum data for downstream use at the cost of greater memory usage.

# For normalisation and imputation we also have a "none" method which uses the 
# identity function to simply return the object without any change

normalisation <- list(
    "none" = identity,
    "linnorm" = norm_linnorm,
    "scran" = norm_scran,
    "tmmwzp" = norm_tmmwzp
)

imputation <- list(
    "none" = identity,
    "drimpute" = impute_drimpute,
    "knn_smooth" = impute_knn_smooth
)

cluster <- list(
    "race_id" = cluster_race_id,
    "seurat" = cluster_seurat,
    "tscan" = cluster_tscan
)

# Wrapper for adjusted rand index written for general case
rand_index <- function(sce, cluster_col, truth_col) {
    cluster <- colData(sce)[, cluster_col]
    truth <- colData(sce)[, truth_col]
    mclust::adjustedRandIndex(cluster, truth)
}

# Create the metric for the specific case
metric <- list(
    ARI = function(x) { rand_index(x, "cluster_id", "cell_line") }
)

We can have a look at an example of a very simple wrapper

normalisation$tmmwzp

## function(sce) {
##     sizeFactors(sce) <- edgeR::calcNormFactors(counts(sce), method = "TMMwzp")
##     sce <- scater::normalize(sce)
## 
##     return(sce)
## }
## <bytecode: 0x7fed6923c718>
## <environment: namespace:NormCBM>

or a more slightly more complex wrapper

cluster$tscan

## function (sce) 
## {
##     expr <- logcounts(sce)
##     res <- TSCAN::exprmclust(expr)
##     res <- res$clusterid
##     colData(sce)$cluster_id <- factor(res)
##     return(sce)
## }
## <bytecode: 0x7fed69336d28>
## <environment: namespace:ClusterCBM>

3 Running Pipeline

With the wrappers set up we can run our pipeline with very minimal code.

res_metric <- datasets %>%
    apply_methods(normalisation) %>%
    apply_methods(imputation) %>%
    apply_methods(cluster) %>%
    apply_methods(metric)

4 Results

Looking at the results, we will see columns corresponding to the methods applied as well as a final column containing a list of the results of the pipelines.

res_metric

## # A tibble: 72 x 6
##    data     normalisation imputation cluster metric result  
##    <fct>    <fct>         <fct>      <fct>   <fct>  <list>  
##  1 sc_cels… none          none       race_id ARI    <dbl [1…
##  2 sc_cels… none          none       seurat  ARI    <dbl [1…
##  3 sc_cels… none          none       tscan   ARI    <task_r…
##  4 sc_cels… none          drimpute   race_id ARI    <dbl [1…
##  5 sc_cels… none          drimpute   seurat  ARI    <dbl [1…
##  6 sc_cels… none          drimpute   tscan   ARI    <dbl [1…
##  7 sc_cels… none          knn_smooth race_id ARI    <dbl [1…
##  8 sc_cels… none          knn_smooth seurat  ARI    <dbl [1…
##  9 sc_cels… none          knn_smooth tscan   ARI    <task_r…
## 10 sc_cels… linnorm       none       race_id ARI    <dbl [1…
## # … with 62 more rows

We will see some elements returned as “task_error” objects, indicating that the computation has failed, this is commonly due to the abundance of zero values of single cell data causing issues with mathematical methods. But the majority of results returned successfully, so we filter out the errors and clean up our results into a nicer form.

# Helper function for use with dplyr::filter, since result column is a list that
# is not compatible with the usual vectorised functions
is_task_error <- function(x) {
    purrr::map_lgl(x, function(obj) is(obj, "task_error"))
}

# Filter out errored entries, convert results to a numeric column and sort based
# on the ARI
res_metric %>%
    filter(!is_task_error(result)) %>%
    mutate(result = as.numeric(result)) %>%
    arrange(desc(result)) %>%
    print(n = 20)

## # A tibble: 64 x 6
##    data       normalisation imputation cluster metric result
##    <fct>      <fct>         <fct>      <fct>   <fct>   <dbl>
##  1 sc_celseq  none          none       seurat  ARI     0.977
##  2 sc_celseq  none          drimpute   seurat  ARI     0.977
##  3 sc_celseq  none          knn_smooth seurat  ARI     0.977
##  4 sc_celseq  linnorm       none       seurat  ARI     0.977
##  5 sc_celseq  linnorm       drimpute   seurat  ARI     0.977
##  6 sc_celseq  linnorm       knn_smooth seurat  ARI     0.977
##  7 sc_celseq  scran         none       seurat  ARI     0.977
##  8 sc_celseq  scran         drimpute   seurat  ARI     0.977
##  9 sc_celseq  scran         knn_smooth seurat  ARI     0.977
## 10 sc_celseq  tmmwzp        none       seurat  ARI     0.977
## 11 sc_celseq  tmmwzp        drimpute   seurat  ARI     0.977
## 12 sc_celseq  tmmwzp        knn_smooth seurat  ARI     0.977
## 13 sc_celseq  tmmwzp        drimpute   tscan   ARI     0.833
## 14 sc_dropseq none          none       tscan   ARI     0.818
## 15 sc_dropseq scran         none       tscan   ARI     0.818
## 16 sc_dropseq none          knn_smooth tscan   ARI     0.774
## 17 sc_dropseq none          none       seurat  ARI     0.763
## 18 sc_dropseq none          knn_smooth seurat  ARI     0.763
## 19 sc_dropseq linnorm       none       seurat  ARI     0.763
## 20 sc_dropseq linnorm       drimpute   seurat  ARI     0.763
## # … with 44 more rows

5 Plotting

We perform some table manipulation to get the data into a form that is appropriate for plotting

# Filter out errors, convert results to numeric and remove the metric column
# since we only have one
res_metric_filtered <- res_metric %>% 
    filter(!is_task_error(result)) %>%
    mutate(result = as.numeric(result)) %>%
    select(-metric)

# pipeline_collapse generates a new column that summarises the pipeline into a 
# single string
res_metric_summarised <- res_metric_filtered %>%
    pipeline_collapse(drop.steps = FALSE, sep = " > ")

res_metric_summarised

## # A tibble: 64 x 6
##    data   normalisation imputation cluster pipeline   result
##    <fct>  <fct>         <fct>      <fct>   <fct>       <dbl>
##  1 sc_ce… none          none       race_id sc_celseq…  0.372
##  2 sc_ce… none          none       seurat  sc_celseq…  0.977
##  3 sc_ce… none          drimpute   race_id sc_celseq…  0.372
##  4 sc_ce… none          drimpute   seurat  sc_celseq…  0.977
##  5 sc_ce… none          drimpute   tscan   sc_celseq…  0.531
##  6 sc_ce… none          knn_smooth race_id sc_celseq…  0.372
##  7 sc_ce… none          knn_smooth seurat  sc_celseq…  0.977
##  8 sc_ce… linnorm       none       race_id sc_celseq…  0.372
##  9 sc_ce… linnorm       none       seurat  sc_celseq…  0.977
## 10 sc_ce… linnorm       drimpute   race_id sc_celseq…  0.372
## # … with 54 more rows

Now we can plot all the results. We will see that seurat and race_id are invariant to the upstream normalisation or imputation. This is actually because both methods perform their own normalisation and ignore imputed expression values.

# Reorder the factor levels so that bar plot is ranked properly
plot_data <- res_metric_summarised %>%
    arrange(result) %>%
    mutate(
        pipeline = fct_inorder(as.character(pipeline)),
        cluster = fct_inorder(as.character(cluster))
    )

plot_data %>%
    ggplot(aes(x = pipeline, y = result, fill = cluster)) +
        geom_bar(stat = "identity") +
        theme_classic() +
        xlab("Pipeline (normalization > imputation > clustering)") +
        ylab("Adjusted Rand Index") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 15))

Plotting all the data could be overwhelming, we are usually only interested in the best (and maybe worst) performing pipelines. Because our data is in a tidy format, we can use the dplyr functions to filter down to the data of interest.

# Keep only CelSeq data
plot_data <- res_metric_summarised %>%
    filter(data == "sc_celseq")

# Keep only the 4 best results for each clustring method
plot_data <- plot_data %>%
    group_by(cluster) %>%
    arrange(desc(result)) %>%
    slice(1:4) %>%
    ungroup() 

plot_data <- plot_data%>%
    arrange(result) %>%
    mutate(
        pipeline = fct_inorder(as.character(pipeline)),
        cluster = fct_inorder(as.character(cluster))
    )
    
plot_data %>%
    ggplot(aes(x = pipeline, y = result, fill = cluster)) +
        theme_classic() +
        geom_bar(stat = "identity") +
        xlab("Pipeline (normalization > imputation > clustering)") +
        ylab("Adjusted Rand Index") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 15)) +
        guides(fill = guide_legend(title = "Clustering Method")) +
        ggtitle("Four highest ranked pipelines for each clustering method (CelSeq)")

# Keep only DropSeq data
plot_data <- res_metric_summarised %>%
    filter(data == "sc_dropseq")

# Keep only the 4 best results for each clustring method
plot_data <- plot_data %>%
    group_by(cluster) %>%
    arrange(desc(result)) %>%
    slice(1:4) %>%
    ungroup() 

plot_data <- plot_data %>%
    arrange(result) %>%
    mutate(
        pipeline = fct_inorder(as.character(pipeline)),
        cluster = fct_inorder(as.character(cluster))
    )
    
plot_data %>%
    ggplot(aes(x = pipeline, y = result, fill = cluster)) +
        theme_classic() +
        geom_bar(stat = "identity") +
        xlab("Pipeline (normalization > imputation > clustering)") +
        ylab("Adjusted Rand Index") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 15)) +
        guides(fill = guide_legend(title = "Clustering Method")) +
        ggtitle("Four highest ranked pipelines for each clustering method (DropSeq)")

6 Summary

The structure of CellBench allows the core benchmarking code to be presented very cleanly as in the “Running Pipeline” section. The complexity is elsewhere in the wrappers. By returning results in a tidy format, they can be easily manipulated using the rich set of tools provided by the tidyverse to investigate more specific details.

CellBench Case Study

2019-09-30

Contents