BgeeDB
is a collection of functions to import data from the Bgee database (http://bgee.org/) directly into R, and to facilitate downstream analyses, such as gene set enrichment test based on expression of genes in anatomical structures. Bgee provides annotated and processed expression data and expression calls from curated wild-type healthy samples, from human and many other animal species.
The package retrieves the annotation of RNA-seq, single cell full length RNA-seq and Affymetrix experiments integrated into the Bgee database, and downloads into R the quantitative data and expression calls produced by the Bgee pipeline. The package also allows to run GO-like enrichment analyses based on anatomical terms, where genes are mapped to anatomical terms by expression patterns, based on the topGO
package. This is the same as the TopAnat web-service available at (http://bgee.org/?page=top_anat#/), but with more flexibility in the choice of parameters and developmental stages.
In summary, the BgeeDB package allows to: * 1. List annotation of RNA-seq, single cell full length RNA-seq and microarray data available the Bgee database * 2. Download the processed gene expression data available in the Bgee database * 3. Download the gene expression calls and use them to perform TopAnat analyses
The pipeline used to generate Bgee expression data is documented and publicly available at (https://github.com/BgeeDB/bgee_pipeline)
If you find a bug or have any issues to use BgeeDB
please write a bug report in our own GitHub issues manager available at (https://github.com/BgeeDB/BgeeDB_R/issues)
In R:
#if (!requireNamespace("BiocManager", quietly=TRUE))
#install.packages("BiocManager")
#BiocManager::install("BgeeDB")
In case BgeeDB is run on Windows please be sure your OS has curl installed. It is installed by default on Windows 10, version 1803 or later. If Git for Windows is installed on the OS then curl is already installed. If not installed please install it before running BgeeDB in order to avoid potential timeout errors when downloading files.
The listBgeeSpecies()
function allows to retrieve available species in the Bgee database, and which data types are available for each species.
##
## Querying Bgee to get release information...
##
## Building URL to query species in Bgee release 15...
##
## Submitting URL to Bgee webservice... (https://bgee.org/bgee15_0/api/?page=r_package&action=get_all_species&display_type=tsv&source=BgeeDB_R_package&source_version=2.24.0)
##
## Query to Bgee webservice successful!
## ID GENUS SPECIES_NAME COMMON_NAME AFFYMETRIX
## 1 6239 Caenorhabditis elegans nematode TRUE
## 2 7227 Drosophila melanogaster fruit fly TRUE
## 3 7237 Drosophila pseudoobscura FALSE
## 4 7240 Drosophila simulans FALSE
## 5 7740 Branchiostoma lanceolatum Common lancelet FALSE
## 6 7897 Latimeria chalumnae Coelacanth FALSE
## 7 7918 Lepisosteus oculatus Spotted gar FALSE
## 8 7936 Anguilla anguilla European freshwater eel FALSE
## 9 7955 Danio rerio zebrafish TRUE
## 10 7994 Astyanax mexicanus Blind cave fish FALSE
## 11 8010 Esox lucius Northern pike FALSE
## 12 8030 Salmo salar Atlantic salmon FALSE
## 13 8049 Gadus morhua Atlantic cod FALSE
## 14 8081 Poecilia reticulata Guppy FALSE
## 15 8090 Oryzias latipes Japanese rice fish FALSE
## 16 8154 Astatotilapia calliptera Eastern happy FALSE
## 17 8355 Xenopus laevis African clawed frog FALSE
## 18 8364 Xenopus tropicalis western clawed frog FALSE
## 19 9031 Gallus gallus chicken FALSE
## 20 9103 Meleagris gallopavo Wild turkey FALSE
## 21 9258 Ornithorhynchus anatinus platypus FALSE
## 22 9483 Callithrix jacchus White-tufted-ear marmoset FALSE
## 23 9531 Cercocebus atys Sooty mangabey FALSE
## 24 9541 Macaca fascicularis Crab-eating macaque FALSE
## 25 9544 Macaca mulatta macaque TRUE
## 26 9545 Macaca nemestrina Pig-tailed macaque FALSE
## 27 9555 Papio anubis Olive baboon FALSE
## 28 9593 Gorilla gorilla gorilla FALSE
## 29 9597 Pan paniscus bonobo FALSE
## 30 9598 Pan troglodytes chimpanzee FALSE
## 31 9606 Homo sapiens human TRUE
## 32 9615 Canis lupus familiaris dog FALSE
## 33 9685 Felis catus cat FALSE
## 34 9796 Equus caballus horse FALSE
## 35 9823 Sus scrofa pig FALSE
## 36 9913 Bos taurus cattle FALSE
## 37 9925 Capra hircus Goat FALSE
## 38 9940 Ovis aries Sheep FALSE
## 39 9974 Manis javanica Malayan pangolin FALSE
## 40 9986 Oryctolagus cuniculus rabbit FALSE
## 41 10090 Mus musculus mouse TRUE
## 42 10116 Rattus norvegicus rat TRUE
## 43 10141 Cavia porcellus guinea pig FALSE
## 44 10181 Heterocephalus glaber Naked mole rat FALSE
## 45 13616 Monodelphis domestica opossum FALSE
## 46 28377 Anolis carolinensis green anole FALSE
## 47 30608 Microcebus murinus Gray mouse lemur FALSE
## 48 32507 Neolamprologus brichardi Fairy cichlid FALSE
## 49 52904 Scophthalmus maximus Turbot FALSE
## 50 60711 Chlorocebus sabaeus Green monkey FALSE
## 51 69293 Gasterosteus aculeatus Three-spined stickleback FALSE
## 52 105023 Nothobranchius furzeri Turquoise killifish FALSE
## EST IN_SITU RNA_SEQ FULL_LENGTH
## 1 FALSE TRUE TRUE FALSE
## 2 TRUE TRUE TRUE FALSE
## 3 FALSE FALSE TRUE FALSE
## 4 FALSE FALSE TRUE FALSE
## 5 FALSE FALSE TRUE FALSE
## 6 FALSE FALSE TRUE FALSE
## 7 FALSE FALSE TRUE FALSE
## 8 FALSE FALSE TRUE FALSE
## 9 TRUE TRUE TRUE FALSE
## 10 FALSE FALSE TRUE FALSE
## 11 FALSE FALSE TRUE FALSE
## 12 FALSE FALSE TRUE FALSE
## 13 FALSE FALSE TRUE FALSE
## 14 FALSE FALSE TRUE FALSE
## 15 FALSE FALSE TRUE FALSE
## 16 FALSE FALSE TRUE FALSE
## 17 FALSE FALSE TRUE FALSE
## 18 TRUE TRUE TRUE FALSE
## 19 FALSE FALSE TRUE FALSE
## 20 FALSE FALSE TRUE FALSE
## 21 FALSE FALSE TRUE FALSE
## 22 FALSE FALSE TRUE FALSE
## 23 FALSE FALSE TRUE FALSE
## 24 FALSE FALSE TRUE FALSE
## 25 FALSE FALSE TRUE FALSE
## 26 FALSE FALSE TRUE FALSE
## 27 FALSE FALSE TRUE FALSE
## 28 FALSE FALSE TRUE FALSE
## 29 FALSE FALSE TRUE FALSE
## 30 FALSE FALSE TRUE FALSE
## 31 TRUE FALSE TRUE TRUE
## 32 FALSE FALSE TRUE FALSE
## 33 FALSE FALSE TRUE FALSE
## 34 FALSE FALSE TRUE FALSE
## 35 FALSE FALSE TRUE FALSE
## 36 FALSE FALSE TRUE FALSE
## 37 FALSE FALSE TRUE FALSE
## 38 FALSE FALSE TRUE FALSE
## 39 FALSE FALSE TRUE FALSE
## 40 FALSE FALSE TRUE FALSE
## 41 TRUE TRUE TRUE TRUE
## 42 FALSE FALSE TRUE FALSE
## 43 FALSE FALSE TRUE FALSE
## 44 FALSE FALSE TRUE FALSE
## 45 FALSE FALSE TRUE FALSE
## 46 FALSE FALSE TRUE FALSE
## 47 FALSE FALSE TRUE FALSE
## 48 FALSE FALSE TRUE FALSE
## 49 FALSE FALSE TRUE FALSE
## 50 FALSE FALSE TRUE FALSE
## 51 FALSE FALSE TRUE FALSE
## 52 FALSE FALSE TRUE FALSE
It is possible to list all species from a specific release of Bgee with the release
argument (see listBgeeRelease()
function), and order the species according to a specific columns with the ordering
argument. For example:
##
## Querying Bgee to get release information...
##
## Building URL to query species in Bgee release 13_2...
##
## Submitting URL to Bgee webservice... (https://r.bgee.org/bgee13/?page=species&display_type=tsv&source=BgeeDB_R_package&source_version=2.24.0)
##
## Query to Bgee webservice successful!
## ID GENUS SPECIES_NAME COMMON_NAME AFFYMETRIX EST IN_SITU
## 17 28377 Anolis carolinensis anolis FALSE FALSE FALSE
## 13 9913 Bos taurus cow FALSE FALSE FALSE
## 1 6239 Caenorhabditis elegans c.elegans TRUE FALSE TRUE
## 3 7955 Danio rerio zebrafish TRUE TRUE TRUE
## 2 7227 Drosophila melanogaster fruitfly TRUE TRUE TRUE
## 5 9031 Gallus gallus chicken FALSE FALSE FALSE
## 8 9593 Gorilla gorilla gorilla FALSE FALSE FALSE
## 11 9606 Homo sapiens human TRUE TRUE FALSE
## 7 9544 Macaca mulatta macaque FALSE FALSE FALSE
## 16 13616 Monodelphis domestica opossum FALSE FALSE FALSE
## 14 10090 Mus musculus mouse TRUE TRUE TRUE
## 6 9258 Ornithorhynchus anatinus platypus FALSE FALSE FALSE
## 9 9597 Pan paniscus bonobo FALSE FALSE FALSE
## 10 9598 Pan troglodytes chimpanzee FALSE FALSE FALSE
## 15 10116 Rattus norvegicus rat FALSE FALSE FALSE
## 12 9823 Sus scrofa pig FALSE FALSE FALSE
## 4 8364 Xenopus tropicalis xenopus FALSE TRUE TRUE
## RNA_SEQ
## 17 TRUE
## 13 TRUE
## 1 TRUE
## 3 FALSE
## 2 FALSE
## 5 TRUE
## 8 TRUE
## 11 TRUE
## 7 TRUE
## 16 TRUE
## 14 TRUE
## 6 TRUE
## 9 TRUE
## 10 TRUE
## 15 TRUE
## 12 TRUE
## 4 TRUE
In the following example we will choose to focus on mouse (“Mus_musculus”) RNA-seq. Species can be specified using their name or their NCBI taxonomic IDs. To specify that RNA-seq data want to be downloaded, the dataType
argument is set to “rna_seq”. To download Affymetrix microarray data, set this argument to “affymetrix”. To download single cell full length RNA-seq data, set this argument to “sc_full_length”.
##
## Querying Bgee to get release information...
##
## Building URL to query species in Bgee release 15_0...
##
## Submitting URL to Bgee webservice... (https://bgee.org/bgee15_0/api/?page=r_package&action=get_all_species&display_type=tsv&source=BgeeDB_R_package&source_version=2.24.0)
##
## Query to Bgee webservice successful!
##
## API key built: b5105aa0a53ffca0a535714ae0a0c8daf3f0509ce249e445a2c95ed1a885669de5aa85fa0e961298dc576716804eff15689501fe19f6e7cb7824a07addfdde74
Note 1: It is possible to work with data from a specific release of Bgee by specifying the release
argument, see listBgeeRelease()
function.
Note 2: The functions of the package will store the downloaded files in a versioned folder created by default in the working directory. These cache files allow faster re-access to the data. The directory where data are stored can be changed with the pathToData
argument.
The getAnnotation()
function will output the list of RNA-seq experiments and libraries available in Bgee for mouse.
##
## Saved annotation files in /tmp/Rtmpqaykhr/Rbuild3cd71933e7897e/BgeeDB/vignettes/Mus_musculus_Bgee_15_0 folder.
## $sample.annotation
## Experiment.ID Library.ID Anatomical.entity.ID Anatomical.entity.name
## 1 SRP013027 SRX1603101 UBERON:0000160 intestine
## 2 SRP013027 SRX1603102 UBERON:0000160 intestine
## 3 SRP013027 SRX1603271 UBERON:0000160 intestine
## 4 SRP013027 SRX1603272 UBERON:0000160 intestine
## 5 SRP013027 SRX1603302 UBERON:0000945 stomach
## 6 SRP013027 SRX1603303 UBERON:0000945 stomach
## Stage.ID Stage.name Sex Strain
## 1 MmusDv:0000032 Theiler stage 23 (mouse) mixed C57BL/6Cr
## 2 MmusDv:0000032 Theiler stage 23 (mouse) mixed C57BL/6Cr
## 3 MmusDv:0000033 Theiler stage 24 (mouse) mixed C57BL/6Cr
## 4 MmusDv:0000033 Theiler stage 24 (mouse) mixed C57BL/6Cr
## 5 MmusDv:0000032 Theiler stage 23 (mouse) mixed C57BL/6Cr
## 6 MmusDv:0000032 Theiler stage 23 (mouse) mixed C57BL/6Cr
## Expression.mapped.anatomical.entity.ID
## 1 UBERON:0000160
## 2 UBERON:0000160
## 3 UBERON:0000160
## 4 UBERON:0000160
## 5 UBERON:0000945
## 6 UBERON:0000945
## Expression.mapped.anatomical.entity.name Expression.mapped.stage.ID
## 1 intestine MmusDv:0000032
## 2 intestine MmusDv:0000032
## 3 intestine MmusDv:0000033
## 4 intestine MmusDv:0000033
## 5 stomach MmusDv:0000032
## 6 stomach MmusDv:0000032
## Expression.mapped.stage.name Expression.mapped.sex Expression.mapped.strain
## 1 Theiler stage 23 (mouse) mixed C57BL/6Cr
## 2 Theiler stage 23 (mouse) mixed C57BL/6Cr
## 3 Theiler stage 24 (mouse) mixed C57BL/6Cr
## 4 Theiler stage 24 (mouse) mixed C57BL/6Cr
## 5 Theiler stage 23 (mouse) mixed C57BL/6Cr
## 6 Theiler stage 23 (mouse) mixed C57BL/6Cr
## Platform.ID Protocol Library.type Library.orientation
## 1 Illumina HiSeq 2500 polyA single NA
## 2 Illumina HiSeq 2500 polyA single NA
## 3 Illumina HiSeq 2500 polyA single NA
## 4 Illumina HiSeq 2500 polyA single NA
## 5 Illumina HiSeq 2500 polyA single NA
## 6 Illumina HiSeq 2500 polyA single NA
## TMM.normalization.factor TPM.expression.threshold Read.count
## 1 1.044514 0.152218 50765810
## 2 1.015358 0.137860 60172356
## 3 0.937326 0.117432 54500508
## 4 0.891839 0.117303 53195360
## 5 1.121846 0.185698 101346283
## 6 1.062458 0.178238 66084918
## Mapped.read.count Min.read.length Max.read.length All.genes.percent.present
## 1 40058638 100 100 43.07
## 2 48356326 100 100 43.02
## 3 46002531 100 100 42.96
## 4 45340369 100 100 42.16
## 5 54759626 100 100 45.49
## 6 51138301 100 100 42.90
## Protein.coding.genes.percent.present Intergenic.regions.percent.present
## 1 73.52 4.88
## 2 73.78 4.83
## 3 74.11 4.75
## 4 73.70 4.67
## 5 74.12 5.20
## 6 73.84 4.89
## Distinct.rank.count Max.rank.in.the.expression.mapped.condition
## 1 27230 NA
## 2 27211 NA
## 3 26810 NA
## 4 26239 NA
## 5 29490 NA
## 6 27754 NA
## Run.IDs Data.source
## 1 SRR3191967 SRA
## 2 SRR3191968 SRA
## 3 SRR3192253 SRA
## 4 SRR3192254 SRA
## 5 SRR3192290|SRR3192291|SRR3192292 SRA
## 6 SRR3192293 SRA
## Data.source.URL
## 1 https://www.ncbi.nlm.nih.gov/sra/SRX1603101
## 2 https://www.ncbi.nlm.nih.gov/sra/SRX1603102
## 3 https://www.ncbi.nlm.nih.gov/sra/SRX1603271
## 4 https://www.ncbi.nlm.nih.gov/sra/SRX1603272
## 5 https://www.ncbi.nlm.nih.gov/sra/SRX1603302
## 6 https://www.ncbi.nlm.nih.gov/sra/SRX1603303
## Bgee.normalized.data.URL
## 1 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP013027.tsv.gz
## 2 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP013027.tsv.gz
## 3 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP013027.tsv.gz
## 4 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP013027.tsv.gz
## 5 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP013027.tsv.gz
## 6 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP013027.tsv.gz
## Raw.file.URL
## 1 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX1603101
## 2 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX1603102
## 3 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX1603271
## 4 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX1603272
## 5 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX1603302
## 6 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX1603303
##
## $experiment.annotation
## Experiment.ID
## 1 SRP013027
## 2 SRP020526
## 3 ERP104395
## 4 GSE30617
## 5 SRP028336
## 6 GSE41637
## Experiment.name
## 1 GSE37909: RNA-seq from ENCODE/Caltech (Mouse)
## 2 Mus musculus strain:black6 x castaneus Transcriptome or Gene expression
## 3 An RNASeq normal tissue atlas for mouse and rat
## 4 [E-MTAB-599] Mouse Transcriptome
## 5 Large-scale multi-species survey of metabolome and lipidome
## 6 Evolutionary dynamics of gene and isoform regulation in mammalian tissues
## Library.count Condition.count Organ.stage.count Organ.count Stage.count
## 1 97 47 47 12 6
## 2 61 51 26 26 4
## 3 38 13 13 13 1
## 4 36 6 6 6 1
## 5 30 10 5 5 1
## 6 26 26 9 9 1
## Sex.count Strain.count Data.source
## 1 1 1 SRA
## 2 2 2 SRA
## 3 1 1 SRA
## 4 1 1 GEO
## 5 2 1 SRA
## 6 2 3 GEO
## Data.source.URL
## 1 https://www.ncbi.nlm.nih.gov/sra/SRP013027
## 2 https://www.ncbi.nlm.nih.gov/sra/SRP020526
## 3 https://www.ncbi.nlm.nih.gov/sra/ERP104395
## 4 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30617
## 5 https://www.ncbi.nlm.nih.gov/sra/SRP028336
## 6 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41637
## Bgee.normalized.data.URL
## 1 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP013027.tsv.gz
## 2 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP020526.tsv.gz
## 3 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_ERP104395.tsv.gz
## 4 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.gz
## 5 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP028336.tsv.gz
## 6 https://bgee.org/ftp/bgee_v15_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE41637.tsv.gz
## Experiment.description
## 1 Overall Design: Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell/mouse). Cells were lysed in RLT buffer (Qiagen RNEasy kit), and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the *on-column* DNAse digestion step to remove residual genomic DNA. A quantity of 75 µgs of total RNA was selected twice with oligo-dT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. A quantity of 100 ngs of mRNA was then processed according to the protocol in Mortazavi et al. (2008), and prepared for sequencing on the Illumina GAIIx or HiSeq platforms according to the protocol for the ChIP-Seq DNA genomic DNA kit (Illumina). Paired-end libraries were size-selected around 200 bp (fragment length). Libraries were sequenced with the Illumina HiSeq according to the manufacturer's recommendations. Paired-end reads of 100 bp length were obtained.Reads were mapped to the reference mouse genome (version mm9 with or without the Y chromosome, depending on the sex of the cell line, and without the random chromosomes in all cases) using TopHat (version 1.3.1) (http://tophat.cbcb.umd.edu/). TopHat was used with default settings with the exception of specifying an empirically determined mean inner-mate distance and supplying known ENSEMBL version 63 splice junctions.
## 2 RNA-Seq of reciprocally crossed Black6 x CAST hybrid mouse tissues.
## 3 The function of a gene is closely connected to its expression specificity across tissues and cell types. RNA-Seq is a powerful quantitative tool to explore genome wide expression. The aim of the present study is to provide a comprehensive RNA-Seq dataset across the same 13 tissues for mouse and rat, two of the most relevant species for biomedical research. The dataset provides the transcriptome across tissues from three male C57BL6 mice and three male Han Wistar rats. We also describe our bioinformatics pipeline to process and technically validate the data. Principal component analysis shows that tissue samples from both species cluster similarly. By comparative genomics we show that many genes with high sequence identity with respect to their human orthologues have also a highly correlated tissue distribution profile and are in agreement with manually curated literature data for human. These results make us confident that the present study provides a unique resource for comparative genomics and will facilitate the analysis of tissue specificity and cross-species conservation in higher organisms.
## 4 Sequencing the transcriptome of DBAxC57BL/6J mice. To study the regulation of transcription, splicing and RNA turnover we have sequenced the transcriptomes of tissues collected DBAxC57BL/6J mice.
## 5 This dataset was generated with the goal of comparative study of gene expression in three brain regions and two non-neural tissues of humans, chimpanzees, macaque monkeys and mice. Using this dataset, we performed studies of gene expression and gene splicing evolution across species and search of tissue-specific gene expression and splicing patterns. We also used the gene expression information of genes encoding metabolic enzymes in this dataset to support a larger comparative study of metabolome evolution in the same set of tissues and species. Overall design: 120 tissue samples of prefrontal cortex (PFC), primary visual cortex (VC), cerebellar cortex (CBC), kidney and skeletal muscle of humans, chimpanzees, macaques and mice. The data accompanies a large set of metabolite measurements of the same tissue samples. Enzyme expression was used to validate metabolite measurement variation among species.
## 6 Most mammalian genes produce multiple distinct mRNAs through alternative splicing, but the extent of splicing conservation is not clear. To assess tissue-specific transcriptome variation across mammals, we sequenced cDNA from 9 tissues from 4 mammals and one bird in biological triplicate, at unprecedented depth. We find that while tissue-specific gene expression programs are largely conserved, alternative splicing is well conserved in only a subset of tissues and is frequently lineage-specific. Thousands of novel, lineage-specific and conserved alternative exons were identified; widely conserved alternative exons had signatures of binding by MBNL, PTB, RBFOX, STAR and TIA family splicing factors, implicating them as ancestral mammalian splicing regulators. Our data also indicates that alternative splicing is often used to alter protein phosphorylatability, delimiting the scope of kinase signaling.
The getData()
function will download processed RNA-seq data from all mouse experiments in Bgee as a data frame.
## downloading data from Bgee FTP...
## You tried to download more than 15 experiments, because of that all the Bgee data for this species will be downloaded.
## Downloading all expression data for species Mus_musculus
## Saved expression data file in /tmp/Rtmpqaykhr/Rbuild3cd71933e7897e/BgeeDB/vignettes/Mus_musculus_Bgee_15_0 folder. Now untar file...
## Finished uncompress tar files
## Save data in local sqlite database
## Load queried data. The query is : SELECT * from rna_seq
## [1] 17
## $Experiment.ID
## [1] "GSE43520" "GSE43520" "GSE43520" "GSE43520" "GSE43520" "GSE43520"
##
## $Library.ID
## [1] "SRX217693" "SRX217693" "SRX217693" "SRX217693" "SRX217693" "SRX217693"
##
## $Library.type
## [1] "single" "single" "single" "single" "single" "single"
##
## $Gene.ID
## [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
## [4] "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"
##
## $Anatomical.entity.ID
## [1] "UBERON:0000473" "UBERON:0000473" "UBERON:0000473" "UBERON:0000473"
## [5] "UBERON:0000473" "UBERON:0000473"
##
## $Anatomical.entity.name
## [1] "\"testis\"" "\"testis\"" "\"testis\"" "\"testis\"" "\"testis\""
## [6] "\"testis\""
##
## $Stage.ID
## [1] "UBERON:0000113" "UBERON:0000113" "UBERON:0000113" "UBERON:0000113"
## [5] "UBERON:0000113" "UBERON:0000113"
##
## $Stage.name
## [1] "\"post-juvenile\"" "\"post-juvenile\"" "\"post-juvenile\""
## [4] "\"post-juvenile\"" "\"post-juvenile\"" "\"post-juvenile\""
##
## $Sex
## [1] "male" "male" "male" "male" "male" "male"
##
## $Strain
## [1] "\"C57BL/6J\"" "\"C57BL/6J\"" "\"C57BL/6J\"" "\"C57BL/6J\"" "\"C57BL/6J\""
## [6] "\"C57BL/6J\""
##
## $Read.count
## [1] 1240.0000 0.0000 589.9095 71.0000 719.0002 755.0000
##
## $TPM
## [1] 14.139628 0.000000 11.807864 2.623488 9.469418 31.708227
##
## $FPKM
## [1] 11.434714 0.000000 9.549016 2.121614 7.657916 25.642435
##
## $Rank
## [1] "6907.00" NA "7546.00" "13224.00" "8320.00" "4414.00"
##
## $Detection.flag
## [1] "present" "absent" "present" "present" "present" "present"
##
## $pValue
## [1] "0.000017570883466797000000000000" NA
## [3] "0.000030680266567301400000000000" "0.001726530042602290000000000000"
## [5] "0.000059401101809800200000000000" "0.000001185680628175260000000000"
##
## $State.in.Bgee
## [1] "Part of a call"
## [2] "Result excluded, reason: pre-filtering"
## [3] "Part of a call"
## [4] "Part of a call"
## [5] "Part of a call"
## [6] "Part of a call"
The result of the getData()
function is a data frame. Each row is a gene and the expression levels are displayed as raw read counts, RPKMs (up to Bgee 13.2), TPMs (from Bgee 14.0), or FPKMs (from Bgee 14.0). A detection flag indicates if the gene is significantly expressed above background level of expression. From Bgee 15.0 a pValue allows to have a precise metric indicating how much the gene is significantly expressed above background level of expression (the detection flag is still available and a gene is considered as present if pValue < 0.05).
Note: If microarray data are downloaded, rows corresponding to probesets and log2 of expression intensities are available instead of read counts/RPKMs/TPMs/FPKMs.
Alternatively, you can choose to download data from one RNA-seq experiment from Bgee with the experimentId
parameter:
## Load queried data. The query is : SELECT * from rna_seq WHERE [Experiment.ID] = "GSE30617"
It is possible to download data by combining filters : * experimentId : one or more experimentId, * sampleId : one or more sampleId (i.e libraryId for RNA-Seq and ChipId for Affymetrix), * anatEntityId : one or more anatomical entity ID from the UBERON ontology (https://uberon.github.io/), * stageId : one or more developmental stage ID from the developmental stage ontologies (https://github.com/obophenotype/developmental-stage-ontologies), * cellTypeId : one or more cell type, only for single cell datatype (from Bgee 15.0), * sex : one or more sex (from Bgee 15.0), * strain : one or more strain (from Bgee 15.0),
# Examples of data downloading using different filtering combination
# retrieve mouse RNA-Seq data for heart or brain
data_bgee_mouse_filters <- getData(bgee, anatEntityId = c("UBERON:0000955","UBERON:0000948"))
# retrieve mouse RNA-Seq data for heart (UBERON:0000955) or brain (UBERON:0000948) part of the experiment GSE30617
data_bgee_mouse_filters <- getData(bgee, experimentId = "GSE30617", anatEntityId = c("UBERON:0000955","UBERON:0000948"))
# retrieve mouse RNA-Seq data for heart (UBERON:0000955) or brain (UBERON:0000948) from post-embryonic stage (UBERON:0000092)
data_bgee_mouse_filters <- getData(bgee, stageId = "UBERON:0000092", anatEntityId = c("UBERON:0000955","UBERON:0000948"))
It is sometimes easier to work with data organized as a matrix, where rows represent genes or probesets and columns represent different samples. The formatData()
function reformats the data into an ExpressionSet object including: * An expression data matrix, with genes or probesets as rows, and samples as columns (assayData
slot). The stats
argument allows to choose if the matrix should be filled with read counts, RPKMs (up to Bgee 13.2), FPKMs (from Bgee 14.0), or TPMs (from Bgee 14.0) for RNA-seq data. For microarray data the matrix is filled with log2 expression intensities. * A data frame listing the samples and their anatomical structure and developmental stage annotation (phenoData
slot) * For microarray data, the mapping from probesets to Ensembl genes (featureData
slot)
The callType
argument allows to retain only actively expressed genes or probesets, if set to “present” or “present high quality”. Genes or probesets that are absent in a given sample are given NA
values.
# use only present calls and fill expression matric with FPKM values
gene.expression.mouse.fpkm <- formatData(bgee, data_bgee_mouse_gse30617, callType = "present", stats = "fpkm")
##
## Extracting expression data matrix...
## Keeping only present genes.
## Warning: `spread_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `spread()` instead.
## ℹ The deprecated feature was likely used in the BgeeDB package.
## Please report the issue at <https://github.com/BgeeDB/BgeeDB_R/issues>.
##
## Extracting features information...
##
## Extracting samples information...
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 55487 features, 36 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: ERX012344 ERX012345 ... ERX012379 (36 total)
## varLabels: Library.ID Anatomical.entity.ID ... Stage.name (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: ENSMUSG00000000001 ENSMUSG00000000003 ...
## ENSMUSG00000118659 (55487 total)
## fvarLabels: Gene.ID
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
For some documentation on the TopAnat analysis, please refer to our publications, or to the web-tool page (http://bgee.org/?page=top_anat#/).
Similarly to the quantitative data download example above, the first step of a topAnat analysis is to built an object from the Bgee class. For this example, we will focus on zebrafish:
##
## NOTE: You did not specify any data type. The argument dataType will be set to c("rna_seq","affymetrix","est","in_situ","sc_full_length") for the next steps.
##
## Querying Bgee to get release information...
##
## NOTE: the file describing Bgee species information for release 15_0 was found in the download directory /tmp/Rtmpqaykhr/Rbuild3cd71933e7897e/BgeeDB/vignettes. Data will not be redownloaded.
##
## API key built: b5105aa0a53ffca0a535714ae0a0c8daf3f0509ce249e445a2c95ed1a885669de5aa85fa0e961298dc576716804eff15689501fe19f6e7cb7824a07addfdde74
Note : We are free to specify any data type of interest using the dataType
argument among rna_seq
, affymetrix
, est
or in_situ
, or even a combination of data types. If nothing is specified, as in the above example, all data types available for the targeted species are used. This equivalent to specifying dataType=c("rna_seq","sc_full_length","affymetrix","est","in_situ")
.
The loadTopAnatData()
function loads a mapping from genes to anatomical structures based on calls of expression in anatomical structures. It also loads the structure of the anatomical ontology and the names of anatomical structures.
##
## Building URLs to retrieve organ relationships from Bgee.........
## URL successfully built (https://bgee.org/bgee15_0/api/?page=r_package&action=get_anat_entity_relations&display_type=tsv&species_list=7955&attr_list=SOURCE_ID&attr_list=TARGET_ID&api_key=b5105aa0a53ffca0a535714ae0a0c8daf3f0509ce249e445a2c95ed1a885669de5aa85fa0e961298dc576716804eff15689501fe19f6e7cb7824a07addfdde74&source=BgeeDB_R_package&source_version=2.24.0)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/Rtmpqaykhr/Rbuild3cd71933e7897e/BgeeDB/vignettes/Danio_rerio_Bgee_15_0"
##
## Building URLs to retrieve organ names from Bgee.................
## URL successfully built (https://bgee.org/bgee15_0/api/?page=r_package&action=get_anat_entities&display_type=tsv&species_list=7955&attr_list=ID&attr_list=NAME&api_key=b5105aa0a53ffca0a535714ae0a0c8daf3f0509ce249e445a2c95ed1a885669de5aa85fa0e961298dc576716804eff15689501fe19f6e7cb7824a07addfdde74&source=BgeeDB_R_package&source_version=2.24.0)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/Rtmpqaykhr/Rbuild3cd71933e7897e/BgeeDB/vignettes/Danio_rerio_Bgee_15_0"
##
## Building URLs to retrieve mapping of gene to organs from Bgee...
## URL successfully built (https://bgee.org/bgee15_0/api/?page=r_package&action=get_expression_calls&display_type=tsv&species_list=7955&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&api_key=b5105aa0a53ffca0a535714ae0a0c8daf3f0509ce249e445a2c95ed1a885669de5aa85fa0e961298dc576716804eff15689501fe19f6e7cb7824a07addfdde74&source=BgeeDB_R_package&source_version=2.24.0&data_qual=SILVER)
## Submitting URL to Bgee webservice (can be long)
## Got results from Bgee webservice. Files are written in "/tmp/Rtmpqaykhr/Rbuild3cd71933e7897e/BgeeDB/vignettes/Danio_rerio_Bgee_15_0"
##
## Parsing the results.............................................
##
## Adding BGEE:0 as unique root of all terms of the ontology.......
##
## Done.
The strigency on the quality of expression calls can be changed with the confidence
argument. Finally, if you are interested in expression data coming from a particular developmental stage or a group of stages, please specify the a Uberon stage Id in the stage
argument.
## Loading silver and gold expression calls from affymetrix data made on embryonic samples only
## This is just given as an example, but is not run in this vignette because only few data are returned
bgee <- Bgee$new(species = "Danio_rerio", dataType="affymetrix")
myTopAnatData <- loadTopAnatData(bgee, stage="UBERON:0000068", confidence="silver")
Note 1: As mentioned above, the downloaded data files are stored in a versioned folder that can be set with the pathToData
argument when creating the Bgee class object (default is the working directory). If you query again Bgee with the exact same parameters, these cached files will be read instead of querying the web-service again. It is thus important, if you plan to reuse the same data for multiple parallel topAnat analyses, to plan to make use of these cached files instead of redownloading them for each analysis. The cached files also give the possibility to repeat analyses offline.
Note 2: In releases up to Bgee 13.2 allowed confidence`` values were `low_quality` or or `high_quality`. From Bgee 14.0
confidence``values are
goldor
silver`.
First we need to prepare a list of genes in the foreground and in the background. The input format is the same as the gene list required to build a topGOdata
object in the topGO
package: a vector with background genes as names, and 0 or 1 values depending if a gene is in the foreground or not. In this example we will look at genes with an annotated phenotype related to “pectoral fin” . We use the biomaRt
package to retrieve this list of genes. We expect them to be enriched for expression in male tissues, notably the testes. The background list of genes is set to all genes annotated to at least one Gene Ontology term, allowing to account for biases in which types of genes are more likely to receive Gene Ontology annotation.
# if (!requireNamespace("BiocManager", quietly=TRUE))
# install.packages("BiocManager")
# BiocManager::install("biomaRt")
library(biomaRt)
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset="drerio_gene_ensembl", host="mar2016.archive.ensembl.org")
# get the mapping of Ensembl genes to phenotypes. It will corresponds to the background genes
universe <- getBM(filters=c("phenotype_source"), value=c("ZFIN"), attributes=c("ensembl_gene_id","phenotype_description"), mart=ensembl)
# select phenotypes related to pectoral fin
phenotypes <- grep("pectoral fin", unique(universe$phenotype_description), value=T)
# Foreground genes are those with an annotated phenotype related to "pectoral fin"
myGenes <- unique(universe$ensembl_gene_id[universe$phenotype_description %in% phenotypes])
# Prepare the gene list vector
geneList <- factor(as.integer(unique(universe$ensembl_gene_id) %in% myGenes))
names(geneList) <- unique(universe$ensembl_gene_id)
summary(geneList)
# Prepare the topGO object
myTopAnatObject <- topAnat(myTopAnatData, geneList)
The above code using the biomaRt
package is not executed in this vignette to prevent building issues of our package in case of biomaRt downtime. Instead we use a geneList
object saved in the data/
folder that we built using pre-downloaded data.
##
## Checking topAnatData object.............
##
## Checking gene list......................
##
## WARNING: Some genes in your gene list have no expression data in Bgee, and will not be included in the analysis. 2959 genes in background will be kept.
##
## Building most specific Ontology terms... ( 1316 Ontology terms found. )
##
## Building DAG topology................... ( 2262 Ontology terms and 4630 relations. )
##
## Annotating nodes (Can be long).......... ( 2959 genes annotated to the Ontology terms. )
Warning: This can be long, especially if the gene list is large, since the Uberon anatomical ontology is large and expression calls will be propagated through the whole ontology (e.g., expression in the forebrain will also be counted as expression in parent structures such as the brain, nervous system, etc). Consider running a script in batch mode if you have multiple analyses to do.
For this step, see the vignette of the topGO
package for more details, as you have to directly use the tests implemented in the topGO
package, as shown in this example:
##
## -- Weight Algorithm --
##
## The algorithm is scoring 1019 nontrivial nodes
## parameters:
## test statistic: fisher : ratio
##
## Level 29: 1 nodes to be scored.
##
## Level 28: 1 nodes to be scored.
##
## Level 27: 1 nodes to be scored.
##
## Level 26: 3 nodes to be scored.
##
## Level 25: 5 nodes to be scored.
##
## Level 24: 6 nodes to be scored.
##
## Level 23: 8 nodes to be scored.
##
## Level 22: 22 nodes to be scored.
##
## Level 21: 22 nodes to be scored.
##
## Level 20: 29 nodes to be scored.
##
## Level 19: 40 nodes to be scored.
##
## Level 18: 71 nodes to be scored.
##
## Level 17: 69 nodes to be scored.
##
## Level 16: 88 nodes to be scored.
##
## Level 15: 103 nodes to be scored.
##
## Level 14: 115 nodes to be scored.
##
## Level 13: 106 nodes to be scored.
##
## Level 12: 91 nodes to be scored.
##
## Level 11: 76 nodes to be scored.
##
## Level 10: 55 nodes to be scored.
##
## Level 9: 31 nodes to be scored.
##
## Level 8: 26 nodes to be scored.
##
## Level 7: 20 nodes to be scored.
##
## Level 6: 17 nodes to be scored.
##
## Level 5: 6 nodes to be scored.
##
## Level 4: 3 nodes to be scored.
##
## Level 3: 2 nodes to be scored.
##
## Level 2: 1 nodes to be scored.
##
## Level 1: 1 nodes to be scored.
Warning: This can be long because of the size of the ontology. Consider running scripts in batch mode if you have multiple analyses to do.
The makeTable
function allows to filter and format the test results, and calculate FDR values.
# Display results sigificant at a 1% FDR threshold
tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, cutoff = 0.01)
##
## Building the results table for the 9 significant terms at FDR threshold of 0.01...
## Ordering results by pValue column in increasing order...
## Done
## organId organName annotated significant
## UBERON:0000151 UBERON:0000151 pectoral fin 410 60
## UBERON:0005419 UBERON:0005419 pectoral appendage bud 133 31
## UBERON:2000040 UBERON:2000040 median fin fold 53 15
## UBERON:0005729 UBERON:0005729 pectoral appendage field 14 8
## UBERON:0006598 UBERON:0006598 presumptive structure 1803 122
## UBERON:0004376 UBERON:0004376 fin bone 27 9
## expected foldEnrichment pValue FDR
## UBERON:0000151 20.23 2.965892 3.105066e-15 3.533565e-12
## UBERON:0005419 6.56 4.725610 3.953367e-14 2.249466e-11
## UBERON:2000040 2.62 5.725191 1.517744e-08 5.757308e-06
## UBERON:0005729 0.69 11.594203 6.792283e-08 1.932405e-05
## UBERON:0006598 88.96 1.371403 1.928360e-06 4.388947e-04
## UBERON:0004376 1.33 6.766917 2.968197e-06 5.629680e-04
At the time of building this vignette (June 2018), there was 27 significant anatomical structures. The first term is “pectoral fin”, and the second “paired limb/fin bud”. Other terms in the list, especially those with high enrichment folds, are clearly related to pectoral fins or substructures of fins. This analysis shows that genes with phenotypic effects on pectoral fins are specifically expressed in or next to these structures
By default results are sorted by p-value, but this can be changed with the ordering
parameter by specifying which column should be used to order the results (preceded by a “-” sign to indicate that ordering should be made in decreasing order). For example, it is often convenient to sort significant structures by decreasing enrichment fold, using ordering = -6
. The full table of results can be obtained using cutoff = 1
.
Of note, it is possible to retrieve for a particular tissue the significant genes that were mapped to it.
# In order to retrieve significant genes mapped to the term " paired limb/fin bud"
term <- "UBERON:0004357"
termStat(myTopAnatObject, term)
## Annotated Significant Expected
## UBERON:0004357 172 37 8.49
## $`UBERON:0004357`
## [1] "ENSDARG00000001057" "ENSDARG00000001785" "ENSDARG00000002445"
## [4] "ENSDARG00000002487" "ENSDARG00000002795" "ENSDARG00000002952"
## [7] "ENSDARG00000003293" "ENSDARG00000003399" "ENSDARG00000004954"
## [10] "ENSDARG00000005479" "ENSDARG00000005645" "ENSDARG00000005762"
## [13] "ENSDARG00000006921" "ENSDARG00000007407" "ENSDARG00000007438"
## [16] "ENSDARG00000007918" "ENSDARG00000008305" "ENSDARG00000008886"
## [19] "ENSDARG00000009438" "ENSDARG00000009534" "ENSDARG00000010192"
## [22] "ENSDARG00000011027" "ENSDARG00000011247" "ENSDARG00000011407"
## [25] "ENSDARG00000011618" "ENSDARG00000012078" "ENSDARG00000012422"
## [28] "ENSDARG00000012824" "ENSDARG00000013144" "ENSDARG00000013409"
## [31] "ENSDARG00000013881" "ENSDARG00000014091" "ENSDARG00000014626"
## [34] "ENSDARG00000014796" "ENSDARG00000015554" "ENSDARG00000015674"
## [37] "ENSDARG00000016022" "ENSDARG00000016454" "ENSDARG00000016858"
## [40] "ENSDARG00000017219" "ENSDARG00000017369" "ENSDARG00000018426"
## [43] "ENSDARG00000018460" "ENSDARG00000018492" "ENSDARG00000018902"
## [46] "ENSDARG00000019260" "ENSDARG00000019353" "ENSDARG00000019579"
## [49] "ENSDARG00000019995" "ENSDARG00000020143" "ENSDARG00000021442"
## [52] "ENSDARG00000021938" "ENSDARG00000022280" "ENSDARG00000024561"
## [55] "ENSDARG00000024894" "ENSDARG00000025081" "ENSDARG00000025641"
## [58] "ENSDARG00000025891" "ENSDARG00000028071" "ENSDARG00000029764"
## [61] "ENSDARG00000030110" "ENSDARG00000030756" "ENSDARG00000031222"
## [64] "ENSDARG00000031894" "ENSDARG00000031952" "ENSDARG00000033327"
## [67] "ENSDARG00000034375" "ENSDARG00000035648" "ENSDARG00000036254"
## [70] "ENSDARG00000036558" "ENSDARG00000037109" "ENSDARG00000037675"
## [73] "ENSDARG00000037677" "ENSDARG00000038006" "ENSDARG00000038428"
## [76] "ENSDARG00000038672" "ENSDARG00000038879" "ENSDARG00000040764"
## [79] "ENSDARG00000041609" "ENSDARG00000041706" "ENSDARG00000041799"
## [82] "ENSDARG00000042296" "ENSDARG00000043130" "ENSDARG00000043559"
## [85] "ENSDARG00000043923" "ENSDARG00000044574" "ENSDARG00000052131"
## [88] "ENSDARG00000052139" "ENSDARG00000052344" "ENSDARG00000052652"
## [91] "ENSDARG00000053479" "ENSDARG00000054026" "ENSDARG00000055026"
## [94] "ENSDARG00000055027" "ENSDARG00000055381" "ENSDARG00000055398"
## [97] "ENSDARG00000056427" "ENSDARG00000056995" "ENSDARG00000058115"
## [100] "ENSDARG00000058543" "ENSDARG00000058822" "ENSDARG00000058996"
## [103] "ENSDARG00000059233" "ENSDARG00000059276" "ENSDARG00000059279"
## [106] "ENSDARG00000060808" "ENSDARG00000061328" "ENSDARG00000061345"
## [109] "ENSDARG00000061600" "ENSDARG00000068365" "ENSDARG00000068732"
## [112] "ENSDARG00000069105" "ENSDARG00000069473" "ENSDARG00000070069"
## [115] "ENSDARG00000070670" "ENSDARG00000070903" "ENSDARG00000071336"
## [118] "ENSDARG00000071560" "ENSDARG00000071699" "ENSDARG00000073814"
## [121] "ENSDARG00000074378" "ENSDARG00000075559" "ENSDARG00000075713"
## [124] "ENSDARG00000076010" "ENSDARG00000076554" "ENSDARG00000076566"
## [127] "ENSDARG00000076856" "ENSDARG00000077121" "ENSDARG00000077353"
## [130] "ENSDARG00000077473" "ENSDARG00000078696" "ENSDARG00000078784"
## [133] "ENSDARG00000079027" "ENSDARG00000079201" "ENSDARG00000079922"
## [136] "ENSDARG00000079964" "ENSDARG00000089805" "ENSDARG00000090820"
## [139] "ENSDARG00000091161" "ENSDARG00000092136" "ENSDARG00000092809"
## [142] "ENSDARG00000095743" "ENSDARG00000096546" "ENSDARG00000098359"
## [145] "ENSDARG00000099088" "ENSDARG00000099458" "ENSDARG00000099996"
## [148] "ENSDARG00000100236" "ENSDARG00000100252" "ENSDARG00000100312"
## [151] "ENSDARG00000100558" "ENSDARG00000100725" "ENSDARG00000101076"
## [154] "ENSDARG00000101199" "ENSDARG00000101209" "ENSDARG00000101244"
## [157] "ENSDARG00000101701" "ENSDARG00000101766" "ENSDARG00000101831"
## [160] "ENSDARG00000102153" "ENSDARG00000102470" "ENSDARG00000102750"
## [163] "ENSDARG00000102824" "ENSDARG00000102995" "ENSDARG00000103432"
## [166] "ENSDARG00000103515" "ENSDARG00000103754" "ENSDARG00000103799"
## [169] "ENSDARG00000104353" "ENSDARG00000104808" "ENSDARG00000104815"
## [172] "ENSDARG00000105230"
# 48 significant genes mapped to this term for Bgee 14.0 and Ensembl 84
annotated <- genesInTerm(myTopAnatObject, term)[["UBERON:0004357"]]
annotated[annotated %in% sigGenes(myTopAnatObject)]
## [1] "ENSDARG00000002445" "ENSDARG00000002952" "ENSDARG00000003293"
## [4] "ENSDARG00000008305" "ENSDARG00000011407" "ENSDARG00000012824"
## [7] "ENSDARG00000013881" "ENSDARG00000014091" "ENSDARG00000018426"
## [10] "ENSDARG00000018902" "ENSDARG00000019260" "ENSDARG00000019353"
## [13] "ENSDARG00000024894" "ENSDARG00000028071" "ENSDARG00000031894"
## [16] "ENSDARG00000036254" "ENSDARG00000037677" "ENSDARG00000038006"
## [19] "ENSDARG00000038672" "ENSDARG00000041799" "ENSDARG00000042296"
## [22] "ENSDARG00000043559" "ENSDARG00000043923" "ENSDARG00000056427"
## [25] "ENSDARG00000058543" "ENSDARG00000069473" "ENSDARG00000070903"
## [28] "ENSDARG00000071336" "ENSDARG00000073814" "ENSDARG00000076856"
## [31] "ENSDARG00000077121" "ENSDARG00000077353" "ENSDARG00000079027"
## [34] "ENSDARG00000099088" "ENSDARG00000100252" "ENSDARG00000100312"
## [37] "ENSDARG00000101831"
Warning: it is debated whether FDR correction is appropriate on enrichment test results, since tests on different terms of the ontologies are not independent. A nice discussion can be found in the vignette of the topGO
package.
Since version 2.14.0 (Bioconductor 3.11) BgeeDB store downloaded expression data in a local RSQLite database. The advantages of this approach compared to the one used in the previous BgeeDB versions are: * do not anymore need a server with lot of memory to access to subset of huge dataset (e.g GTeX dataset) * more granular filtering using arguments in the getData() function * do not download twice the same data * fast access to data once integrated in the local database
This approach comes with some drawbacks : * the SQLite local database use more disk space than the previously conpressed .rds approach * first access to a dataset takes more time (integration to SQLite local database is time consuming)
It is possible to remove .rds files generated in previous versions of BgeeDB and not used anymore since version 2.14.0. The function below delete all .rds files for the selected species and for all datatype.
bgee <- Bgee$new(species="Mus_musculus", release = "14.1")
# delete all old .rds files of species Mus musculus
deleteOldData(bgee)
As the new SQLite approach use more disk space it is now possible to delete all local data of one species from one release of Bgee.