The MicrobiomeBenchamrkData
package provides access to a collection of
datasets with biological ground truth for benchmarking differential
abundance methods. The datasets are deposited on Zenodo:
https://doi.org/10.5281/zenodo.6911026
## Install BioConductor if not installed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
## Release version (not yet in Bioc, so it doesn't work yet)
BiocManager::install("MicrobiomeBenchmarkData")
## Development version
BiocManager::install("waldronlab/MicrobiomeBenchmarkData")
library(MicrobiomeBenchmarkData)
library(purrr)
All sample metadata is merged into a single data frame and provided as a data object:
data('sampleMetadata', package = 'MicrobiomeBenchmarkData')
## Get columns present in all samples
sample_metadata <- sampleMetadata |>
discard(~any(is.na(.x))) |>
head()
knitr::kable(sample_metadata)
dataset | sample_id | body_site | library_size | pmid | study_condition | sequencing_method |
---|---|---|---|---|---|---|
HMP_2012_16S_gingival_V13 | 700103497 | oral_cavity | 5356 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700106940 | oral_cavity | 4489 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700097304 | oral_cavity | 3043 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700099015 | oral_cavity | 2832 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700097644 | oral_cavity | 2815 | 22699609 | control | 16S |
HMP_2012_16S_gingival_V13 | 700097247 | oral_cavity | 6333 | 22699609 | control | 16S |
Currently, there are 6
datasets available through the MicrobiomeBenchmarkData. These datasets are
accessed through the getBenchmarkData
function.
If no arguments are provided, the list of available datasets is printed on screen and a data.frame is returned with the description of the datasets:
dats <- getBenchmarkData()
#> 1 HMP_2012_16S_gingival_V13
#> 2 HMP_2012_16S_gingival_V35
#> 3 HMP_2012_16S_gingival_V35_subset
#> 4 HMP_2012_WMS_gingival
#> 5 Stammler_2016_16S_spikein
#> 6 Ravel_2011_16S_BV
#>
#> Use vignette('datasets', package = 'MicrobiomeBenchmarkData') for a detailed description of the datasets.
#>
#> Use getBenchmarkData(dryrun = FALSE) to import all of the datasets.
dats
#> Dataset Dimensions Body.site
#> 1 HMP_2012_16S_gingival_V13 33127 x 311 Gingiva
#> 2 HMP_2012_16S_gingival_V35 17949 x 311 Gingiva
#> 3 HMP_2012_16S_gingival_V35_subset 892 x 76 Gingiva
#> 4 HMP_2012_WMS_gingival 235 x 16 Gingiva
#> 5 Stammler_2016_16S_spikein 247 x 394 Stool
#> 6 Ravel_2011_16S_BV 4036 x 17 Vagina
#> Contrasts
#> 1 Subgingival vs Supragingival plaque.
#> 2 Subgingival vs Supragingival plaque.
#> 3 Subgingival vs Supragingival plaque.
#> 4 Subgingival vs Supragingival plaque.
#> 5 Pre-ASCT (allogeneic stem cell transplantation) vs 14 days after treatment.
#> 6 Healthy vs bacterial vaginosis
#> Biological.ground.truth
#> 1 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 2 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 3 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 4 Enrichment of aerobic taxa in the supragingival plaque and enrichment of anaerobic taxa in the subgingival plaque.
#> 5 Same bacterial loads of the spike-in bacteria across all samples: Salinibacter ruber (extreme halophilic), Rhizobium radiobacter (found in soils and plants), and Alicyclobacillus acidiphilu (thermo-acidophilic).
#> 6 Decrease of Lactobacillus and increase of bacteria isolated during bacterial vaginosis in samples with high Nugent scores (bacterial vaginosis).
In order to import a dataset, the getBenchmarkData
function must be used with
the name of the dataset as the first argument (x
) and the dryrun
argument
set to FALSE
. The output is a list vector with the dataset imported as a
TreeSummarizedExperiment object.
tse <- getBenchmarkData('HMP_2012_16S_gingival_V35_subset', dryrun = FALSE)[[1]]
#> Finished HMP_2012_16S_gingival_V35_subset.
tse
#> class: TreeSummarizedExperiment
#> dim: 892 76
#> metadata(0):
#> assays(1): counts
#> rownames(892): OTU_97.31247 OTU_97.44487 ... OTU_97.45365 OTU_97.45307
#> rowData names(7): kingdom phylum ... genus taxon_annotation
#> colnames(76): 700023057 700023179 ... 700114009 700114338
#> colData names(13): dataset subject_id ... sequencing_method
#> variable_region_16s
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (892 rows)
#> rowTree: 1 phylo tree(s) (892 leaves)
#> colLinks: NULL
#> colTree: NULL
Several datasets can be imported simultaneously by giving the names of the different datasets in a character vector:
list_tse <- getBenchmarkData(dats$Dataset[2:4], dryrun = FALSE)
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
str(list_tse, max.level = 1)
#> List of 3
#> $ HMP_2012_16S_gingival_V35 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_WMS_gingival :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
If all of the datasets must to be imported, this can be done by providing
the dryrun = FALSE
argument alone.
mbd <- getBenchmarkData(dryrun = FALSE)
#> Finished HMP_2012_16S_gingival_V13.
#> Finished HMP_2012_16S_gingival_V35.
#> Finished HMP_2012_16S_gingival_V35_subset.
#> Finished HMP_2012_WMS_gingival.
#> Warning: No taxonomy_tree available for Ravel_2011_16S_BV.
#> Finished Ravel_2011_16S_BV.
#> Warning: No taxonomy_tree available for Stammler_2016_16S_spikein.
#> Finished Stammler_2016_16S_spikein.
str(mbd, max.level = 1)
#> List of 6
#> $ HMP_2012_16S_gingival_V13 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35 :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_16S_gingival_V35_subset:Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ HMP_2012_WMS_gingival :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ Ravel_2011_16S_BV :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
#> $ Stammler_2016_16S_spikein :Formal class 'TreeSummarizedExperiment' [package "TreeSummarizedExperiment"] with 14 slots
The biological annotations of each taxa are provided as a column in the
rowData
slot of the TreeSummarizedExperiment.
## In the case, the column is named as taxon_annotation
tse <- mbd$HMP_2012_16S_gingival_V35_subset
rowData(tse)
#> DataFrame with 892 rows and 7 columns
#> kingdom phylum class order
#> <character> <character> <character> <character>
#> OTU_97.31247 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.44487 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.34979 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.34572 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.42259 Bacteria Firmicutes Bacilli Lactobacillales
#> ... ... ... ... ...
#> OTU_97.44294 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45429 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.44375 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45365 Bacteria Firmicutes Bacilli Lactobacillales
#> OTU_97.45307 Bacteria Firmicutes Bacilli Lactobacillales
#> family genus taxon_annotation
#> <character> <character> <character>
#> OTU_97.31247 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44487 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34979 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.34572 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.42259 Streptococcaceae Streptococcus facultative_anaerobic
#> ... ... ... ...
#> OTU_97.44294 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45429 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.44375 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45365 Streptococcaceae Streptococcus facultative_anaerobic
#> OTU_97.45307 Streptococcaceae Streptococcus facultative_anaerobic
The datasets are cached so they’re only downloaded once. The cache and all of
the files contained in it can be removed with the removeCache
function.
removeCache()
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] purrr_0.3.5 MicrobiomeBenchmarkData_1.0.0
#> [3] TreeSummarizedExperiment_2.6.0 Biostrings_2.66.0
#> [5] XVector_0.38.0 SingleCellExperiment_1.20.0
#> [7] SummarizedExperiment_1.28.0 Biobase_2.58.0
#> [9] GenomicRanges_1.50.0 GenomeInfoDb_1.34.0
#> [11] IRanges_2.32.0 S4Vectors_0.36.0
#> [13] BiocGenerics_0.44.0 MatrixGenerics_1.10.0
#> [15] matrixStats_0.62.0 BiocStyle_2.26.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.4.4 sass_0.4.2 tidyr_1.2.1
#> [4] bit64_4.0.5 jsonlite_1.8.3 bslib_0.4.0
#> [7] assertthat_0.2.1 highr_0.9 BiocManager_1.30.19
#> [10] BiocFileCache_2.6.0 yulab.utils_0.0.5 blob_1.2.3
#> [13] GenomeInfoDbData_1.2.9 yaml_2.3.6 pillar_1.8.1
#> [16] RSQLite_2.2.18 lattice_0.20-45 glue_1.6.2
#> [19] digest_0.6.30 htmltools_0.5.3 Matrix_1.5-1
#> [22] pkgconfig_2.0.3 bookdown_0.29 zlibbioc_1.44.0
#> [25] tidytree_0.4.1 BiocParallel_1.32.0 tibble_3.1.8
#> [28] generics_0.1.3 withr_2.5.0 cachem_1.0.6
#> [31] lazyeval_0.2.2 cli_3.4.1 magrittr_2.0.3
#> [34] crayon_1.5.2 memoise_2.0.1 evaluate_0.17
#> [37] fansi_1.0.3 nlme_3.1-160 tools_4.2.1
#> [40] lifecycle_1.0.3 stringr_1.4.1 DelayedArray_0.24.0
#> [43] compiler_4.2.1 jquerylib_0.1.4 rlang_1.0.6
#> [46] grid_4.2.1 RCurl_1.98-1.9 rappdirs_0.3.3
#> [49] bitops_1.0-7 rmarkdown_2.17 codetools_0.2-18
#> [52] DBI_1.1.3 curl_4.3.3 R6_2.5.1
#> [55] knitr_1.40 dplyr_1.0.10 fastmap_1.1.0
#> [58] bit_4.0.4 utf8_1.2.2 filelock_1.0.2
#> [61] treeio_1.22.0 ape_5.6-2 stringi_1.7.8
#> [64] parallel_4.2.1 Rcpp_1.0.9 vctrs_0.5.0
#> [67] dbplyr_2.2.1 tidyselect_1.2.0 xfun_0.34