Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis

English version České info

We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more “continuous,” as in isolation-by-distance models.

Vyšlo v časopise: Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis. PLoS Genet 6(9): e32767. doi:10.1371/journal.pgen.1001117
Kategorie: Research Article
prolekare.web.journal.doi_sk: https://doi.org/10.1371/journal.pgen.1001117

Souhrn

Zdroje

1. RosenbergNA

PritchardJK

WeberJL

CannHM

KiddKK

2002 Genetic Structure of Human Populations. Science 298 2381 2385

2. ReichD

ThangarajK

PattersonN

PriceAL

SinghL

2009 Reconstructing Indian population history. Nature 461 489 494

3. WasserSK

MailandC

BoothR

MutayobaB

KisamoE

2007 Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban. Proceedings of the National Academy of Sciences 104 4228 4233

4. ParkerHG

KimLV

SutterNB

CarlsonS

LorentzenTD

2004 Genetic Structure of the Purebred Domestic Dog. Science 304 1160 1164

5. PritchardJK

RosenbergNA

1999 Use of unlinked genetic markers to detect population stratification in association studies. American Journal of Human Genetics 65 220 228

6. PritchardJ

2001 Case-Control Studies of Association in Structured or Admixed Populations. Theoretical Population Biology 60 227 237

7. PriceAL

PattersonNJ

PlengeRM

WeinblattME

ShadickNA

2006 Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904 909

8. FalushD

StephensM

PritchardJK

2003 Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies. Genetics 164 1567 1587

9. TangH

PengJ

WangP

RischNJ

2005 Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology 28 289 301

10. TangH

CoramM

WangP

ZhuX

RischN

2006 Reconstructing genetic ancestry blocks in admixed individuals. American Journal of Human Genetics 79 1 12

11. AlexanderDH

NovembreJ

LangeK

2009 Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19 1655 1664

12. ZhuX

ZhangS

ZhaoH

CooperRS

2002 Association mapping, using a mixture model for complex traits. Genetic Epidemiology 23 181 196

13. PattersonN

PriceAL

ReichD

2006 Population Structure and Eigenanalysis. PLoS Genetics 2 e190 doi:10.1371/journal.pgen.0020190

14. NovembreJ

JohnsonT

BrycK

KutalikZ

BoykoAR

2008 Genes mirror geography within Europe. Nature 456 98 101

15. LaoO

LuTT

NothnagelM

JungeO

Freitag-WolfS

2008 Correlation between Genetic and Geographic Structure in Europe. Current Biology 18 1241 1248

16. BuntineW

2002 Variational extensions to EM and multinomial PCA. In: Proceedings of the European Conference on Machine Learning

17. PritchardJK

StephensM

DonnellyP

2000 Inference of population structure using multilocus genotype data. Genetics 155 945 959

18. EckartC

YoungG

1936 The approximation of one matrix by another of lower rank. Psychometrika 1 211 218

19. LucasJ

CarvalhoC

WangQ

BildA

NevinsJ

2006 Sparse Statistical Modelling in Gene Expression Genomics Cambridge University Press 155 176

20. FokoueE

2004 Stochastic determination of the intrinsic structure in Bayesian factor analysis. Tech. rep., Statistical and Applied Mathematical Sciences Institute (SAMSI)

21. CarvalhoC

ChangJ

LucasJ

NevinsJR

WangQ

2008 High-Dimensional Sparse Factor Modelling: Applications in Gene Expression Genomics. Journal of the American Statistical Association 103 1438 1456

22. PournaraI

WernischL

2007 Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC Bioinformatics 8

23. LeeDD

SeungHS

1999 Learning the parts of objects by non-negative matrix factorization. Nature 401 788 791

24. WittenDM

TibshiraniR

HastieT

2009 A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515 534

25. MackayDJC

1992 Bayesian methods for adaptive models. Ph.D. thesis, California Institute of Technology, Pasadena, CA

26. NealRM

1996 Bayesian Learning for Neural Networks. Lecture Notes in Statistics No. 118, Springer-Verlag

27. TippingME

2000 The relevance vector machine. In: Proceedings of the Neural Information Processing Systems 12

28. LawrenceN

2005 Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research 6 1783 1816

29. ConradDF

JakobssonM

CoopG

WenX

WallJD

2006 A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics 38 1251 1260

30. NovembreJ

StephensM

2008 Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40 646 649

31. McVeanG

2009 A Genealogical Interpretation of Principal Components Analysis. PLoS Genetics 5 e1000686 doi:10.1371/journal.pgen.1000686

32. NelsonMR

BrycK

KingKS

IndapA

BoykoAR

2008 The Population Reference Sample, POPRES: A Resource for Population, Disease, and Pharmacological Genetics Research. American Journal of Human Genetics 83 347 358

33. SerreD

PääboS

2004 Evidence for Gradients of Human Genetic Diversity Within and Among Continents. Genome Research 14 1679 1685

34. LeeDD

SeungSH

2001 Algorithms for Non-negative Matrix Factorization. 556 562 In: Advances in Neural Information Processing Systems 13

35. WestM

2003 Bayesian Factor Regression Models in the Large p, Small n Paradigm. Bayesian Statistics 7 723 732

36. CannyJ

2002 Collaborative filtering with privacy via factor analysis. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval New York, NY, USA ACM 238 245

37. LopesHF

WestM

2004 Bayesian model assessment in factor analysis. Statistica Sinica 14 41 67

38. HudsonRR

2002 Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18 337 338

39. HowieBN

DonnellyP

MarchiniJ

2009 A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genetics 5 e1000529 doi:10.1371/journal.pgen.1000529

40. TippingME

FaulAC

2003 Fast marginal likelihood maximization for sparse Bayesian models.

BishopCM

FreyBJ

Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics

41. LiuC

RubinDB

1994 The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 81 633 648

42. GhahramaniZ

HintonGE

1996 The EM algorithm for mixtures of factor analyzers. Tech. rep., CRG-TR-96-1