Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study

Autoři: Linda Vidman aff001;  David Källberg aff001;  Patrik Rydén aff001
Působiště autorů: Department of Mathematics and Mathematical Statistics, Umeå University, Umeå, Sweden aff001;  Department of Statistics, USBE, Umeå University, Umeå, Sweden aff002
Vyšlo v časopise: PLoS ONE 14(12)
Kategorie: Research Article
prolekare.web.journal.doi_sk: 10.1371/journal.pone.0219102



Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.


In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.


The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.

Klíčová slova:

Algorithms – Breast cancer – Clustering algorithms – Gene expression – Hierarchical clustering – k means clustering – Kidneys – Stomach


1. Aure MR, Vitelli V, Jernström S, Kumar S, Krohn M, Due EU, et al. Integrative clustering reveals a novel split in the luminal A subtype of breast cancer with impact on outcome. Breast cancer research: BCR. 2017;19(1):44–. doi: 10.1186/s13058-017-0812-y 28356166.

2. Zhao L, Zhao H, Yan H. Gene expression profiling of 1200 pancreatic ductal adenocarcinoma reveals novel subtypes. BMC Cancer. 2018;18(1):603. doi: 10.1186/s12885-018-4546-8 29843660

3. Shachar SS, Muss HB. Internet tools to enhance breast cancer care. NPJ Breast Cancer. 2016;2:16011. doi: 10.1038/npjbcancer.2016.11 28721377.

4. Kwa M, Makris A, Esteva FJ. Clinical utility of gene-expression signatures in early stage breast cancer. Nature Reviews Clinical Oncology. 2017;14:595. doi: 10.1038/nrclinonc.2017.74 28561071

5. Nguyen MN, Choi TG, Nguyen DT, Kim J-H, Jo YH, Shahid M, et al. CRC-113 gene expression signature for predicting prognosis in patients with colorectal cancer. Oncotarget. 2015;6(31):31674–92. PMC4741632. doi: 10.18632/oncotarget.5183 26397224

6. Xu G, Zhang M, Zhu H, Xu J. A 15-gene signature for prediction of colon cancer recurrence and prognosis based on SVM. Gene. 2017;604:33–40. doi: 10.1016/j.gene.2016.12.016 27998790

7. Abdul Aziz NA, Mokhtar NM, Harun R, Mollah MMH, Mohamed Rose I, Sagap I, et al. A 19-Gene expression signature as a predictor of survival in colorectal cancer. BMC Medical Genomics. 2016;9(1):58. doi: 10.1186/s12920-016-0218-1 PMC5016995. 27609023

8. Matondo A, Jo YH, Shahid M, Choi TG, Nguyen MN, Nguyen NNY, et al. The Prognostic 97 Chemoresponse Gene Signature in Ovarian Cancer. Scientific Reports. 2017;7:9689. doi: 10.1038/s41598-017-08766-5 PMC5575202. 28851888

9. Cheon D-J, Tong Y, Sim M-S, Dering J, Berel D, Cui X, et al. A collagen-remodeling gene signature regulated by TGFβ signaling is associated with metastasis and poor survival in serous ovarian cancer. Clinical cancer research: an official journal of the American Association for Cancer Research. 2014;20(3):711–23. doi: 10.1158/1078-0432.CCR-13-1256 PMC3946428. 24218511

10. Adib TR, Henderson S, Perrett C, Hewitt D, Bourmpoulia D, Ledermann J, et al. Predicting biomarkers for ovarian cancer using gene-expression microarrays. British Journal of Cancer. 2004;90(3):686–92. doi: 10.1038/sj.bjc.6601603 PMC2409606. 14760385

11. Ogino S, Fuchs CS, Giovannucci E. How many molecular subtypes? Implications of the unique tumor principle in personalized medicine. Expert review of molecular diagnostics. 2012;12(6):621–8. doi: 10.1586/erm.12.46 22845482.

12. Sotiriou C, Neo S-Y, McShane LM, Korn EL, Long PM, Jazaeri A, et al. Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(18):10393–8. doi: 10.1073/pnas.1732912100 PMC193572. 12917485

13. Ren Z, Wang W, Li J. Identifying molecular subtypes in human colon cancer using gene expression and DNA methylation microarray data. International Journal of Oncology. 2016;48(2):690–702. doi: 10.3892/ijo.2015.3263 PMC4725456. 26647925

14. Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(3):811–6. doi: 10.1073/pnas.0304146101 PMC321763. 14711987

15. Bertucci F, Finetti P, Rougemont J, Charafe-Jauffret E, Cervera N, Tarpin C, et al. Gene Expression Profiling Identifies Molecular Subtypes of Inflammatory Breast Cancer. Cancer Research. 2005;65(6):2170–8. doi: 10.1158/0008-5472.CAN-04-4115 15781628

16. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998;95(25):14863–8. PMC24541. doi: 10.1073/pnas.95.25.14863 9843981

17. Fujikado N, Saijo S, Iwakura Y. Identification of arthritis-related gene clusters by microarray analysis of two independent mouse models for rheumatoid arthritis. Arthritis Research & Therapy. 2006;8(4):R100–R. doi: 10.1186/ar1985 PMC1779393. 16805906

18. Kohonen T. Self-organized formation of topologically correct feature maps. Biological Cybernetics. 1982;43(1):59–69. doi: 10.1007/BF00337288

19. Kaufman L, Rousseeuw PJ. Partitioning Around Medoids (Program PAM). Finding Groups in Data: John Wiley & Sons, Inc.; 2008. p. 68–125.

20. Ester M, Kriegel H-P, #246, Sander r, Xu X. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Portland, Oregon. 3001507: AAAI Press; 1996. p. 226–31.

21. Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, et al. Clustering Algorithms: Their Application to Gene Expression Data. Bioinformatics and Biology Insights. 2016;10:237–53. doi: 10.4137/BBI.S38316 PMC5135122. 27932867

22. Yu X, Yu G, Wang J. Clustering cancer gene expression data by projective clustering ensemble. PLOS ONE. 2017;12(2):e0171429. doi: 10.1371/journal.pone.0171429 28234920

23. Jaskowiak PA, Costa IG, Campello RJGB. Clustering of RNA-Seq samples: Comparison study on cancer data. Methods. 2017.

24. de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC bioinformatics. 2008;9:497. Epub 2008/11/29. doi: 10.1186/1471-2105-9-497 19038021; PubMed Central PMCID: PMCPMC2632677.

25. Freyhult E, Landfors M, Önskog J, Hvidsten TR, Rydén P. Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering. BMC bioinformatics. 2010;11:503–. doi: 10.1186/1471-2105-11-503 PMC3098084. 20937082

26. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics. 2011;12(1):323. doi: 10.1186/1471-2105-12-323 21816040

27. Brat DJ, Verhaak RG, Aldape KD, Yung WK, Salama SR, Cooper LA, et al. Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas. The New England journal of medicine. 2015;372(26):2481–98. Epub 2015/06/11. doi: 10.1056/NEJMoa1402121 26061751; PubMed Central PMCID: PMCPMC4530011.

28. Carroll JS. Mechanisms of oestrogen receptor (ER) gene regulation in breast cancer. Eur J Endocrinol. 2016;175(1):R41–R9. Epub 2016/07/. doi: 10.1530/EJE-16-0124 26884552.

29. Network TCGAR. Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma. New England Journal of Medicine. 2016;374(2):135–45. doi: 10.1056/NEJMoa1505917 26536169.

30. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513(7517):202–9. Epub 2014/08/01. doi: 10.1038/nature13480 25079317; PubMed Central PMCID: PMCPMC4170219.

31. The Cancer Genome Atlas N, Koboldt DC, Fulton RS, McLellan MD, Schmidt H, Kalicki-Veizer J, et al. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61. doi: 10.1038/nature11412 23000897

32. Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967;32(3):241–54. doi: 10.1007/bf02289588 5234703

33. MacQueen J, editor Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics; 1967 1967; Berkeley, Calif.: University of California Press.

34. Wehrens R, Buydens LMC. Self- and Super-organizing Maps in R: The kohonen Package. 2007. 2007;21(5):19. Epub 2007-07-20. doi: 10.18637/jss.v021.i05

35. Frey BJ, Dueck D. Clustering by passing messages between data points. Science (New York, NY). 2007;315(5814):972–6. Epub 2007/01/16. doi: 10.1126/science.1136800 17218491.

36. Bodenhofer U, Kothmeier A, Hochreiter S. APCluster: an R package for affinity propagation clustering. Bioinformatics. 2011;27(17):2463–4. doi: 10.1093/bioinformatics/btr406 21737437

37. Hornik K. A CLUE for CLUster Ensembles. 2005. 2005;14(12):25. Epub 2005-01-31. doi: 10.18637/jss.v014.i12

38. Hastie T, Tibshirani R, Friedman J. Random Forests. In: Hastie T, Tibshirani R, Friedman J, editors. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY: Springer New York; 2009. p. 587–604.

39. Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2(3):18–22. citeulike-article-id:1121494.

40. Meilă M. Comparing clusterings—an information based distance. Journal of Multivariate Analysis. 2007;98(5):873–95.

41. Cover TM, Thomas JA. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing): Wiley-Interscience; 2006.

42. Fowlkes EB, Mallows CL. A Method for Comparing Two Hierarchical Clusterings. Journal of the American Statistical Association. 1983;78(383):553–69. doi: 10.1080/01621459.1983.10478008

43. Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218. doi: 10.1007/BF01908075

44. Chen W-C. Overlapping codon model, phylogenetic clustering, and alternative partial expectation conditional maximization algorithm: Iowa State University; 2011.

45. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2017.

46. Delongchamp RR, Velasco C, Dial S, Harris AJ. Genome-wide estimation of gender differences in the gene expression of human livers: statistical design and analysis. BMC bioinformatics. 2005;6 Suppl 2(Suppl 2):S13–S. doi: 10.1186/1471-2105-6-S2-S13 16026598.

47. Guillen I, Fernández Massó J, Daniel P, Dueñas S, Han J, Zhang Z, et al. Analysis of gene expression profile for gender in human blood samples2014. pp. 329–42 p.

48. Yuan Y, Liu L, Chen H, Wang Y, Xu Y, Mao H, et al. Comprehensive Characterization of Molecular Differences in Cancer between Male and Female Patients. Cancer cell. 2016;29(5):711–22. doi: 10.1016/j.ccell.2016.04.001 27165743.

Článok vyšiel v časopise


2019 Číslo 12