Hidden heterogeneity (such as ancestry) within genetic summary data can lead to confounding in association testing or inaccurate prioritization of putative variants. Here, we provide Summix, a method to estimate and adjust for reference ancestry groups within genetic allele frequency data. This method was developed by the Summix team at the University of Colorado Denver and is headed by Dr Audrey Hendricks.
References Arriaga-MacKenzie IS, Matesi G, Chen S, Ronco A, Marker KM, Hall JR, Scherenberg R, Khajeh-Sharafabadi M, Wu Y, Gignoux CR, Null M, Hendricks AE (2021). Summix: A method for detecting and adjusting for population structure in genetic summary data. Am J Hum Genet 2021 108, 1270-1282. https://doi.org/10.1016/j.ajhg.2021.05.016
Function to estimate reference ancestry proportions in heterogeneous genetic data.
The Summix function uses the slsqp() function in the nloptr package to run Sequential Quadratic Programming. https://www.rdocumentation.org/packages/nloptr/versions/1.2.2.2/topics/slsqp
summix(data, reference, observed, pi.start)
data Data frame of the observed and reference allele frequencies for N genetic variants. See data formatting document at https://github.com/hendriau/Summix for more information
reference Character vector of the column names for the reference ancestries.
observed Column name of the heterogeneous observed ancestry as a character string,
pi.start Numeric vector of length K of the starting guess for the ancestry proportions. If not specified, this defaults to 1/K where K is the number of reference ancestry groups.
Estimates the proportion of each reference ancestry within the chosen observed group
Data frame with components:
library("Summix")
# load the data
data("ancestryData")
# Estimate 5 reference ancestry proportion values for the gnomAD African/African Amercian ancestry group
summix( data = ancestryData,
reference=c("ref_AF_afr_1000G",
"ref_AF_eur_1000G",
"ref_AF_sas_1000G",
"ref_AF_iam_1000G",
"ref_AF_eas_1000G"),
observed="gnomad_AF_afr" )
## objective iterations time filtered ref_AF_afr_1000G
## 1 2.135725 23 0.3586423 secs 0 0.8250554
## ref_AF_eur_1000G ref_AF_sas_1000G ref_AF_iam_1000G ref_AF_eas_1000G
## 1 0.1576768 0.003285205 0.006308355 0.007674304
library("Summix")
# load the data
data("ancestryData")
# Estimate 5 reference ancestry proportion values for the gnomAD African/African Amercian ancestry group
summix( data = ancestryData,
reference=c("ref_AF_afr_1000G",
"ref_AF_eur_1000G",
"ref_AF_sas_1000G",
"ref_AF_iam_1000G",
"ref_AF_eas_1000G"),
observed="gnomad_AF_afr",
pi.start = c(0.8, 0.1, 0.05, 0.02, 0.03))
## objective iterations time filtered ref_AF_afr_1000G
## 1 2.135725 27 0.3139136 secs 0 0.8250554
## ref_AF_eur_1000G ref_AF_sas_1000G ref_AF_iam_1000G ref_AF_eas_1000G
## 1 0.1576768 0.003285201 0.006308353 0.007674306
Ancestry Adjusted Allele Frequency Function to estimate ancestry adjusted allele frequencies given the proportion of reference ancestry groups.
adjAF(data, reference, observed, pi.target, pi.observed)
data Data frame of unadjusted allele frequency for observed group, K-1 reference ancestry allele frequencies for N SNPs
reference Character vector of the column names for K-1 reference ancestry groups. The name of the last reference ancestry group is not included as that group is not used to estimate the adjusted allele frequencies.
observed Column name for the observed ancestry.
pi.observed Numeric vector of the mixture proportions for K reference ancestry groups for the observed group. The order must match the order of the reference specified reference character vector with the last entry matching the missing ancestry reference group.
pi.target Numeric vector of the mixture proportions for K reference ancestry groups in the target sample or subject. Order must match the order of the specified reference character vector with the last entry matching the missing ancestry reference group.
Estimates ancestry adjusted allele frequencies in an observed sample of allele frequencies given estimated reference ancestry proportions and the observed AFs for K-1 reference ancestry groups.
List with components:
## CHR RSID POS A1 A2 ref_AF_eur_1000G ref_AF_afr_1000G
## 1 1 rs2887286 1156131 C T 0.173275495 0.54166349
## 2 1 rs41477744 2329564 A G 0.001237745 0.03571448
## 3 1 rs9661525 2952840 G T 0.168316089 0.12004821
## 4 1 rs2817174 3044181 C T 0.428212624 0.95932526
## 5 1 rs12139206 3504073 T C 0.204214851 0.80156548
## 6 1 rs7514979 3654595 T C 0.004950604 0.41865218
## ref_AF_sas_1000G ref_AF_eas_1000G ref_AF_iam_1000G gnomad_AF_afr
## 1 0.53171227 0.8462232 0.7093 0.4886100
## 2 0.00000000 0.0000000 0.0000 0.0459137
## 3 0.09918029 0.3938534 0.2442 0.1359770
## 4 0.63907198 0.5704540 0.5000 0.8548790
## 5 0.39367076 0.3898812 0.3372 0.7241780
## 6 0.00000000 0.0000000 0.0000 0.3362490
## gnomad_AF_amr gnomad_AF_oth
## 1 0.52594300 0.22970500
## 2 0.00117925 0.00827206
## 3 0.28605200 0.15561700
## 4 0.48818000 0.47042500
## 5 0.29550800 0.25874800
## 6 0.01650940 0.02481620
tmp.aa<-adjAF(data = ancestryData,
reference = c("ref_AF_eur_1000G"),
observed = "gnomad_AF_afr",
pi.target = c(0, 1),
pi.observed = c(.15, .85))
## $pi
## ref.group pi.observed pi.target
## 1 ref_AF_eur_1000G 0.15 0
## 2 NONE 0.85 1
##
## $observed.data
## [1] "observed data to update AF: 'gnomad_AF_afr'"
##
## $Nsnps
## [1] 10000
##
## [[4]]
## [1] "use $adjusted.AF to see adjusted AF data"
## CHR RSID POS A1 A2 ref_AF_eur_1000G ref_AF_afr_1000G
## 1 1 rs2887286 1156131 C T 0.173275495 0.54166349
## 2 1 rs41477744 2329564 A G 0.001237745 0.03571448
## 3 1 rs9661525 2952840 G T 0.168316089 0.12004821
## 4 1 rs2817174 3044181 C T 0.428212624 0.95932526
## 5 1 rs12139206 3504073 T C 0.204214851 0.80156548
## ref_AF_sas_1000G ref_AF_eas_1000G ref_AF_iam_1000G gnomad_AF_afr
## 1 0.53171227 0.8462232 0.7093 0.4886100
## 2 0.00000000 0.0000000 0.0000 0.0459137
## 3 0.09918029 0.3938534 0.2442 0.1359770
## 4 0.63907198 0.5704540 0.5000 0.8548790
## 5 0.39367076 0.3898812 0.3372 0.7241780
## gnomad_AF_amr gnomad_AF_oth adjustedAF
## 1 0.52594300 0.22970500 0.54425727
## 2 0.00117925 0.00827206 0.05379769
## 3 0.28605200 0.15561700 0.13027010
## 4 0.48818000 0.47042500 0.93017307
## 5 0.29550800 0.25874800 0.81593620
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Summix_2.4.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.30 R6_2.5.1 jsonlite_1.8.3 magrittr_2.0.3
## [5] evaluate_0.17 stringi_1.7.8 rlang_1.0.6 cachem_1.0.6
## [9] cli_3.4.1 nloptr_2.0.3 jquerylib_0.1.4 bslib_0.4.0
## [13] rmarkdown_2.17 tools_4.2.1 stringr_1.4.1 xfun_0.34
## [17] yaml_2.3.6 fastmap_1.1.0 compiler_4.2.1 htmltools_0.5.3
## [21] knitr_1.40 sass_0.4.2