GUESS-ing Polygenic Associations with Multiple Phenotypes Using a GPU-Based Evolutionary Stochastic Search Algorithm

English version České info

Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n = 3,175), when compared with the largest published meta-GWAS (n>100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.

Vyšlo v časopise: GUESS-ing Polygenic Associations with Multiple Phenotypes Using a GPU-Based Evolutionary Stochastic Search Algorithm. PLoS Genet 9(8): e32767. doi:10.1371/journal.pgen.1003657
Kategorie: Research Article
prolekare.web.journal.doi_sk: https://doi.org/10.1371/journal.pgen.1003657

Souhrn

Zdroje

1. SabattiC, ServiceSK, HartikainenAL, PoutaA, RipattiS, et al. (2009) Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet 41 : 677–687.

2. TeslovichTM, MusunuruK, SmithAV, EdmondsonAC, StylianouIM, et al. (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466 : 707–713.

3. BrownPJ, VannucciM, FearnT (1998) Multivariate Bayesian variable selection and prediction. J Roy Stat Soc B 60 : 627–641.

4. Denison DGT, Holmes CC, Mallick BK, Smith AFM (2002) Bayesian Methods for Nonlinear Classification and Regression. New York: Wiley.

5. MonniS, TadesseMG (2009) A stochastic partitioning method to associate high-dimensional responses and covariates (with discussion). Bayesian Analysis 4 : 413–436.

6. Chipman H, George EI, McCulloch RE (2001) The practical implementation of Bayesian model selection (with discussion). In: Lahiri P, editor. Model Selection. Beachwood: IMS.

7. BottoloL, RichardsonS (2010) Evolutionary Stochastic Search for Bayesian model exploration. Bayesian Analysis 5 : 583–618.

8. GuanY, StephensM (2011) Bayesian variable selection regression for Genome-Wide Association Studies, and other large-scale problems. Ann Appl Stat 5 : 1780–1815.

9. MarchiniJ, HowieB, MyersS, McVeanG, DonnellyP (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39 : 906–913.

10. NathR, PavurR (1985) A new statistic in the one way multivariate analysis of variance. Comp Stat Data An 2 : 297–315.

11. FriedmanJ, HastieT, TibshiraniR (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 : 1–22.

12. ShenH, HuangJZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivariate Anal 99 : 1015–1034.

13. MooreJH, AsselbergsFW, WilliamsSM (2010) Bioinformatics challenges for genome-wide association studies. Bioinformatics 26 : 445–455.

14. WilsonMA, IversenES, ClydeMA, SchmidlerSC, SchildkrautJM (2010) Bayesian model search and multilevel inference for SNP association studies. Ann Appl Stat 4 : 1342–1364.

15. DeukwooD, LandiMT, VannucciM, IssaqHJ, PrietoDR, et al. (2011) An efficient stochastic search for Bayesian variable selection with high-dimensional correlated predictors. Comput Stat & Data Anal 55 : 2807–2818.

16. ZhangW, ZhuJ, SchadtEE, LiuJS (2010) A Bayesian partition model for detecting pleiotropic and epistatic eQTL modules. PLoS Comput Biol 6: e1000642.

17. LeeA, YauC, GilesM, DoucetA, HolmesC (2010) On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. J Comput Graph Statist 19 : 769–789.

18. WakefieldJ (2009) Bayes factors for genome-wide association studies: comparison with p-values. Genet Epidem 33 : 79–86.

19. StephensM, BaldingDJ (2009) Bayesian statistical methods for genetic association studies. Nat Rev Gen 10 : 681–690.

20. TibshiraniR (1996) Regression shrinkage and selection via the Lasso. J R Statist Soc B 58 : 267–288.

21. FanJ, LiR (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96 : 1348–1360.

22. ZouH, HastieT (2005) Regularization and variable selection via the elastic net. J R Statist Soc B 67 : 301–320.

23. ZouH (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101 : 1418–1429.

24. TibshiraniR, SaundersM, RossetS, JiZ, KnightK (2005) Sparsity and smoothness via the fused Lasso. J R Statist Soc B 67 : 91–108.

25. WuTT, ChenYF, HastieT, SobelE, LangeK (2009) Genome-wide association analysis by Lasso penalized logistic regression. Bioinformatics 25 : 714–721.

26. LengC, LinY, WahbaG (2006) A note on the Lasso and related procedures in model selection. Statist Sin 16 : 1273–1284.

27. Jin L (2011) Penalized methods in genome-wide association studies: University of Iowa. Available: http://ir.uiowa.edu/etd/1242/ Accessed 30 June 2013.

28. MeinshausenN, BuehlmannP (2010) Stability selection (with discussion). J R Statist Soc B 72 : 417–473.

29. ZellerT, WildP, SzymczakS, RotivalM, SchillertA, et al. (2010) Genetics and beyond: the transcriptome of human monocytes and disease susceptibility. PLoS ONE 5: e10693.

30. MusunuruK, StrongA, Frank-KamenetskyM, LeeNE, AhfeldtT, et al. (2010) From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466 : 714–719.

31. KjolbyM, AndersenOM, BreiderhoffT, FjorbackAW, PedersenKM, et al. (2010) SORT1, encoded by the cardiovascular risk locus 1p133, is a regulator of hepatic lipoprotein export. Cell Metab 12 : 213–223.

32. AnnemaW, TietgeUJ (2011) Role of hepatic lipase and endothelial lipase in high-density lipoprotein-mediated reverse cholesterol transport. Curr Atheroscler Rep 13 : 257–265.

33. PetersenAK, StarkK, MusamehMD, NelsonCP, Römisch-MarglW, et al. (2012) Genetic associations with lipoprotein subfractions provide information on their biological nature. Hum Mol Genet 21 : 1433–1443.

34. McCarthyMI, HirschhornJN (2008) Genome-wide association studies: potential next steps on a genetic journey. Hum Mol Genet 17: R156–165.

35. BraunTR, BeenLF, SinghalA, WorshamJ, RalhanS, et al. (2012) A replication study of GWAS-derived lipid genes in asian indians: the chromosomal region 11q233 harbors loci contributing to triglycerides. PLoS ONE 7: e37056.

36. NordestgaardBG, BennM, SchnohrP, Tybjærg-HansenA (2007) Nonfasting triglycerides and risk of myocardial infarction, ischemic heart disease, and death in men and women. JAMA 298 : 299–308.

37. NordestgaardBG, PalmerTM, BennM, ZachoJ, Tybjærg-HansenA, et al. (2012) The effect of elevated body mass index on ischemic heart disease risk: causal estimates from a Mendelian randomisation approach. PLOS Med 9: e1001212.

38. BalkauB (1996) An epidemiologic survey from a network of French Health Examination Centres, (D.E.S.I.R.): epidemiologic data on the insulin resistance syndrome. Rev Epidemiol Sante Publique 4 : 373–375.

39. VoightBF, KangHM, DingJ, PalmerCD, SidoreC, et al. (2012) The Metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet 8: e1002793.

40. McVean, etal (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491 : 56–65.

41. KimS, XingEP (2009) Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet 5: e1000587.

42. PetrettoE, BottoloL, LangleySR, HeinigM, McDermott-RoeMC, et al. (2010) New insights into the genetic control of gene expression using a Bayesian multi-tissue approach. PLoS Comput Biol 6: e1000737.

43. StoreyJD (2002) A direct approach to false discovery rates. J R Statist Soc B 63 : 479–98.

44. YangJ, FerreiraT, MorrisAP, MedlandSE, et al. (2012) Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 44 : 369–36.

45. Judd CM, McClelland GH, Ryan CS (2009) Data Analysis: A Model Comparison Approach. London: Routledge.

46. HoetingJA, MadiganD, RafteryAE, VolinskyCT (1999) Bayesian model averaging: a tutorial (with discussion). Stat Sci 14 : 382–401.

47. Hartung J, Knapp G, Sinha BK (2008) Bayesian Meta-Analysis, in Statistical Meta-Analysis with Applications. New York: John Wiley & Sons, Inc.

48. ScheetP, StephensM (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78 : 629–644.

49. CarlsonCS, EberleMA, RiederMJ, YiQ, KruglyakL, et al. (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 74 : 106–120.

50. BottoloL, Chadeau-HyamM, HastieDI, LangleySR, PetrettoE, et al. (2011) ESS++: a C++ objected-oriented algorithm for Bayesian stochastic search model exploration. Bioinformatics 27 : 587–588.

51. BarbieriMM, BergerJO (2004) Optimal predictive model selection. Ann Stat 33 : 870–897.

52. ChenW, GhoshD, TrivelloreE, RaghunathanTE, SargentDJ (2009) Bayesian Variable Selection with joint modelling of categorical and survival outcomes: an application to individualizing chemotherapy treatment in advanced colorectal cancer. Biometrics 65 : 1030–1040.

53. ServinB, StephensM (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 3: e114.

54. KassRE, RafteryAE (1995) Bayes Factors. J Am Stat Assoc 90 : 773–79.

55. Lê CaoK-A, RossouwD, Robert-GraniéC, BesseP (2008) A sparse PLS for variable selection when integrating Omics data. Stat App Gen Mol Biol 7 article 35.

56. Lê CaoK-A, MartinPGP, Robert-GraniéC, BesseP (2009) Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics 10 : 34.