The nullranges package contains functions for generation of feature sets (genomic regions) for exploring the null hypothesis of overlap or colocalization of two observed feature sets.

The package has two branches of functionality: matching or bootstrapping to generate null feature sets. The decision about which approach to use is ultimately up to the bioinformatics analyst. Here we describe the two different approaches briefly. For a listing of all the vignettes in the package, one can type:

vignette(package="nullranges")

Brief description of methods

Suppose we want to examine the significance of overlaps of genomic sets of features \(x\) and \(y\). To test the significance of this overlap, we calculate the overlap expected under the null by generating a null feature set \(y'\) (potentially many times). The null features in \(y'\) may be characterized by:

  1. Drawing from a larger pool \(z\) (\(y' \subset z\)), such that \(y\) and \(y'\) have a similar distribution over one or more covariates. This is the “matching” case. Note that the features in \(y'\) are original features, just drawn from a different pool than y. The matchRanges method is described in Davis et al. (2022) doi: 10.1101/2022.08.05.502985.
  2. Generating a new set of genomic features \(y'\), constructing them from the original set \(y\) by selecting blocks of the genome with replacement, i.e. such that features can be sampled more than once. This is the “bootstrapping” case. Note that, in this case, \(y'\) is an artificial feature set, although the re-sampled features can retain covariates such as score from the original feature set \(y\). The bootRanges method is described in Mu et al. (2022) doi: 10.1101/2022.09.02.506382.

In other words

  1. Matching – drawing from a pool of features but controlling for certain characteristics
  2. Bootstrapping – placing a number of artificial features in the genome but controlling for their spatial distribution

Options and features

We provide a number of vignettes to describe the different matching and bootstrapping use cases. In the matching case, we have implemented a number of options, including nearest neighbor matching or rejection sampling based matching. In the bootstrapping case, we have implemented options for bootstrapping across or within chromosomes, and bootstrapping only within states of a segmented genome. We also provide a function to segment the genome by density of features. For example, supposing that \(x\) is a subset of genes, we may want to generate \(y'\) from \(y\) such that features are re-sampled in blocks from segments across the genome with similar gene density. In both cases, we provide a number of functions for performing quality control via visual inspection of diagnostic plots.

Consideration of excluded regions

Finally, we recommend to incorporate list of regions where artificial features should not be placed, including the ENCODE Exclusion List (Amemiya, Kundaje, and Boyle 2019). This and other excluded ranges are made available in the excluderanges Bioconductor package by Mikhail Dozmorov. Use of excluded ranges is demonstrated in the segmented block bootstrap vignette.

References

Amemiya, Haley M, Anshul Kundaje, and Alan P Boyle. 2019. “The ENCODE Blacklist: Identification of Problematic Regions of the Genome.” Scientific Reports 9 (1): 9354. https://doi.org/10.1038/s41598-019-45839-z.

Bickel, Peter J., Nathan Boley, James B. Brown, Haiyan Huang, and Nancy R. Zhang. 2010. “Subsampling Methods for Genomic Inference.” The Annals of Applied Statistics 4 (4): 1660–97. https://doi.org/10.1214/10-{AOAS363}.

Davis, Eric S., Wancen Mu, Stuart Lee, Mikhail G. Dozmorov, Michael I. Love, and Douglas H. Phanstiel. 2022. “MatchRanges: Generating Null Hypothesis Genomic Ranges via Covariate-Matched Sampling.” bioRxiv. https://doi.org/10.1101/2022.08.05.502985.

De, Subhajyoti, Brent S Pedersen, and Katerina Kechris. 2014. “The Dilemma of Choosing the Ideal Permutation Strategy While Estimating Statistical Significance of Genome-Wide Enrichment.” Briefings in Bioinformatics 15 (6): 919–28. https://doi.org/10.1093/bib/bbt053.

Favorov, Alexander, Loris Mularoni, Leslie M Cope, Yulia Medvedeva, Andrey A Mironov, Vsevolod J Makeev, and Sarah J Wheelan. 2012. “Exploring Massive, Genome Scale Datasets with the GenometriCorr Package.” PLoS Computational Biology 8 (5): e1002529. https://doi.org/10.1371/journal.pcbi.1002529.

Gel, Bernat, Anna Díez-Villanueva, Eduard Serra, Marcus Buschbeck, Miguel A Peinado, and Roberto Malinverni. 2016. “regioneR: An R/Bioconductor Package for the Association Analysis of Genomic Regions Based on Permutation Tests.” Bioinformatics 32 (2): 289–91. https://doi.org/10.1093/bioinformatics/btv562.

Haiminen, Niina, Heikki Mannila, and Evimaria Terzi. 2007. “Comparing Segmentations by Applying Randomization Techniques.” BMC Bioinformatics 8 (May): 171. https://doi.org/10.1186/1471-2105-8-171.

Heger, Andreas, Caleb Webber, Martin Goodson, Chris P Ponting, and Gerton Lunter. 2013. “GAT: A Simulation Framework for Testing the Association of Genomic Intervals.” Bioinformatics 29 (16): 2046–8. https://doi.org/10.1093/bioinformatics/btt343.

Huen, David S, and Steven Russell. 2010. “On the Use of Resampling Tests for Evaluating Statistical Significance of Binding-Site Co-Occurrence.” BMC Bioinformatics 11 (June): 359. https://doi.org/10.1186/1471-2105-11-359.

Kanduri, Chakravarthi, Christoph Bock, Sveinung Gundersen, Eivind Hovig, and Geir Kjetil Sandve. 2019. “Colocalization Analyses of Genomic Elements: Approaches, Recommendations and Challenges.” Bioinformatics 35 (9): 1615–24. https://doi.org/10.1093/bioinformatics/bty835.

Lawrence, Wolfgang AND Pagès, Michael AND Huber. 2013. “Software for Computing and Annotating Genomic Ranges.” PLOS Computational Biology 9 (8): 1–10. https://doi.org/10.1371/journal.pcbi.1003118.

McLean, Cory Y., Dave Bristor, Michael Hiller, Shoa L. Clarke, Bruce T. Schaar, Craig B. Lowe, Aaron M. Wenger, and Gill Bejerano. 2010. “GREAT Improves Functional Interpretation of Cis-Regulatory Regions.” Nature Biotechnology 28 (5): 495–501. https://doi.org/10.1038/nbt.1630.

Mu, Wancen, Eric S. Davis, Stuart Lee, Mikhail G. Dozmorov, Douglas H. Phanstiel, and Michael I. Love. 2022. “BootRanges: Flexible Generation of Null Sets of Genomic Ranges for Hypothesis Testing.” bioRxiv. https://doi.org/10.1101/2022.09.02.506382.

Sheffield, Nathan C, and Christoph Bock. 2016. “LOLA: Enrichment Analysis for Genomic Region Sets and Regulatory Elements in R and Bioconductor.” Bioinformatics 32 (4): 587–89. https://doi.org/10.1093/bioinformatics/btv612.

Welch, Ryan P., Chee Lee, Paul M. Imbriano, Snehal Patil, Terry E. Weymouth, R. Alex Smith, Laura J. Scott, and Maureen A. Sartor. 2014. “ChIP-Enrich: gene set enrichment testing for ChIP-seq data.” Nucleic Acids Research 42 (13): e105–e105. https://doi.org/10.1093/nar/gku463.