Leukemia cancer is one of the most leading detrimental cancer diseases in worldwide. A huge number of genes are responsible for cancer diseases. Therefore, it is necessary to identify the most informative genes of Leukemia cancer. The main objectives of this study are to: (i) identify the most informative genes using five feature selection techniques (FST) and (ii) adopt six classifiers to classify the cancer disease and compare them. Leukemia cancer data has been taken from Kent ridge biomedical data repository, USA. There are 7129 genes and 72 patients. Among them, 47 patients are cancer and 25 are control. We have used five FST as t-test; Wilcoxon sign rank sum (WCSRS) test, random forest (RF), Boruta and least absolute shrinkage and selection operator (LASSO). We have also used six classifiers as Adaboost (AB), classification and regression tree (CART), artificial neural network (ANN), random forest (RF), linear discriminant analysis (LDA) and naive Bayes (NB). The performances of these classifiers are evaluated by accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), and F-measure (FM). We used simulated dataset to check the validity of proposed method. The results indicate that the combination of LASSO based FST and NB classifier gives the highest classification accuracy of 99.95%. On the basis of the results, we can conclude that the combination of LASSO based FST and NB classifier predicts the leukemia cancer more accurately compare to any other combination of FST and classifiers utilized in this study.
Published in | Machine Learning Research (Volume 5, Issue 2) |
DOI | 10.11648/j.mlr.20200502.11 |
Page(s) | 18-27 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2020. Published by Science Publishing Group |
Leukemia, Cancer, Feature Selection, Machine Learning, Classification
[1] | Dasari, Subramanyam, RajendraWudayagiri, and LokanathaValluru. "Cervical cancer: Biomarkers for diagnosis and treatment." Clinicachimicaacta 445 (2015): 7-11. |
[2] | Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A (2018) Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians 68 (6): 394-424. |
[3] | Hüsemann, Yves, Jochen B. Geigl, Falk Schubert, PieroMusiani, Manfred Meyer, ElkeBurghart, Guido Forni et al. "Systemic spread is an early step in breast cancer." Cancer cell 13, no. 1 (2008): 58-68. |
[4] | Vos T, Allen C, Arora M, Barber RM, Bhutta ZA, Brown A, Carter A, Casey DC, Charlson FJ, Chen AZ, Coggeshall M (2015) Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015. The Lancet 388 (10053): 1545-602. |
[5] | Azuaje F (2000) Interpretation of genome expression patterns: computational challenges and opportunities. IEEE engineering in medicine and biology magazine: the quarterly magazine of the Engineering in Medicine & Biology Society 19 (6): 119. |
[6] | Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (5439): 531-7. |
[7] | Nguyen DV, Rocke DM (2002) Classification of acute leukemia based on DNA microarray gene expressions using partial least squares. InMethods of Microarray Data Analysis (pp. 109-124). Springer, Boston, MA. |
[8] | Slonim DK, Tamayo P, Mesirov JP, Golub TR, Lander ES (2000) Class prediction and discovery using gene expression data. InProceedings of the fourth annual international conference on Computational molecular biology (pp. 263-272). |
[9] | Harrington CA, Rosenow C, Retief J (2000) Monitoring gene expression using DNAmicroarrays. Current opinion in Microbiology 3 (3): 285-91. |
[10] | Lu Y, Han J (2003) Cancer classification using gene expression data. Information Systems 28 (4): 243-68. |
[11] | Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC bioinformatics. 2006 Dec 1; 7 (1): 3. |
[12] | Ruiz R, Riquelme JC, Aguilar-Ruiz JS (2006) Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recognition 39 (12): 2383-92. |
[13] | Pirooznia M, Yang JY, Yang MQ, Deng Y (2008) A comparative study of different machine learning methods on microarray gene expression data. BMC genomics 9 (S1): S13. |
[14] | Xi M, Sun J, Liu L, Fan F, Wu X (2016) Cancer feature selection and classification using a binary quantum-behaved particle swarm optimization and support vector machine. Computational and mathematical Methods in Medicine. |
[15] | Nguyen DV, Rocke DM (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18 (1): 39-50. |
[16] | Student S, Fujarewicz K (2012) Stable feature selection and classification algorithms for multiclass microarray data. Biology direct 7 (1): 33. |
[17] | Zhu Z, Ong YS, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition 40 (11): 3236-48. |
[18] | Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences 96 (12): 6745-50. |
[19] | Maniruzzaman M, Rahman MJ, Ahammed B, Abedin MM, Suri HS, Biswas M, El-Baz A, Bangeas P, Tsoulfas G, Suri JS (2019) Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms. Computer methods and programs in biomedicine 176: 173-193. |
[20] | Wilcoxon F (1992) Individual comparisons by ranking methods. InBreakthroughs in statistics (pp. 196-202). Springer, New York, NY. |
[21] | Guan Z, Zhao H (2005) Asemiparametric approach for marker gene selection based on gene expression data. Bioinformatics 21 (4): 529-36. |
[22] | Li S, Wu X, Hu X (2008) Gene selection using genetic algorithm and support vectors machines. Soft computing 12 (7): 693-8. |
[23] | Breiman L (2001) Random Forests. Machine Learning, vol. 45. |
[24] | Soemedi R, Cygan KJ, Rhine CL, Wang J, Bulacan C, Yang J, Bayrak-Toydemir P, McDonald J, Fairbrother WG (2017) Pathogenic variants that alter protein code often disrupt splicing. Nature genetics 49 (6): 848. |
[25] | Stoppiglia H, Dreyfus G, Dubois R, Oussar Y (2003) Ranking a random feature for variable and feature selection. Journal of machine learning research 3 (Mar): 1399-414. |
[26] | Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267-88. |
[27] | Fonti V, Belitser E (2017) Feature selection using lasso. VU Amsterdam Research Paper in Business Analytics 1-25. |
[28] | Bauer E, Kohavi R (1999) An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning 36 (1-2): 105-39. |
[29] | Solomatine DP, Shrestha DL (2004) AdaBoost. RT: a boosting algorithm for regression problems. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541) (Vol. 2, pp. 1163-1168). IEEE. |
[30] | Breiman L (1999) Random forests. UC Berkeley TR567. |
[31] | Razi MA, Athappilly K (2005) A comparative predictive analysis of neural networks (NNs), nonlinear regression and classification and regression tree (CART) models. Expert Systems with Applications 29 (1): 65-74. |
[32] | Markram, H. (2012) ‘The human brain projects’, Scientific American, Vol. 306, No. 6, pp. 50-55. |
[33] | Markram H (2012) The human brain project. Scientific American 306 (6): 50-5. |
[34] | Ahammed B, Abedin M (2018) Predicting wine types with different classification techniques. Model Assisted Statistics and Applications 13 (1): 85-93. |
[35] | Ho TK (1995) Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278-282). IEEE. |
[36] | Quinlan JR. Bagging, boosting, and C4. 5. In AAAI/IAAI, Vol. 1 1996 Aug 4 (pp. 725-730). |
[37] | Strobl C, Boulesteix AL, Augustin T (2007) Unbiased split selection for classification trees based on the Gini index. Computational Statistics & Data Analysis 52 (1): 483-501. |
[38] | Manel S, Dias JM, Ormerod SJ (1999) Comparing discriminant analysis, neural networks and logistic regression for predicting species distributions: a case study with a Himalayan river bird. Ecological modelling 120 (2-3): 337-47. |
[39] | Schütze H, Manning CD, Raghavan P (2008) Introduction to information retrieval. Cambridge: Cambridge University Press. |
[40] | Rish I (2001) An empirical study of the naive Bayes classifier. InIJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46). |
[41] | Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16 (10): 906-14. |
[42] | Li L, Weinberg CR, Darden TA, Pedersen LG (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17 (12): 1131-42. |
[43] | Dev J, Dash SK, Dash S, Swain M (2012) A classification technique for microarray gene expression data using PSO-FLANN. International Journal on Computer Science and Engineering 4 (9): 1534. |
[44] | Sharma A, Paliwal KK (2012) A gene selection algorithm using Bayesian classification approach. American Journal of Applied Sciences 9 (1): 127-31. |
[45] | Bhola A, Tiwari AK (2015) Machine learning based approaches for cancer classification using gene expression data. Machine Learning and Applications: An International Journal (MLAIJ) 2 (3/4). |
APA Style
Md. Alamgir Sarder, Md. Maniruzzaman, Benojir Ahammed. (2020). Feature Selection and Classification of Leukemia Cancer Using Machine Learning Techniques. Machine Learning Research, 5(2), 18-27. https://doi.org/10.11648/j.mlr.20200502.11
ACS Style
Md. Alamgir Sarder; Md. Maniruzzaman; Benojir Ahammed. Feature Selection and Classification of Leukemia Cancer Using Machine Learning Techniques. Mach. Learn. Res. 2020, 5(2), 18-27. doi: 10.11648/j.mlr.20200502.11
AMA Style
Md. Alamgir Sarder, Md. Maniruzzaman, Benojir Ahammed. Feature Selection and Classification of Leukemia Cancer Using Machine Learning Techniques. Mach Learn Res. 2020;5(2):18-27. doi: 10.11648/j.mlr.20200502.11
@article{10.11648/j.mlr.20200502.11, author = {Md. Alamgir Sarder and Md. Maniruzzaman and Benojir Ahammed}, title = {Feature Selection and Classification of Leukemia Cancer Using Machine Learning Techniques}, journal = {Machine Learning Research}, volume = {5}, number = {2}, pages = {18-27}, doi = {10.11648/j.mlr.20200502.11}, url = {https://doi.org/10.11648/j.mlr.20200502.11}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.mlr.20200502.11}, abstract = {Leukemia cancer is one of the most leading detrimental cancer diseases in worldwide. A huge number of genes are responsible for cancer diseases. Therefore, it is necessary to identify the most informative genes of Leukemia cancer. The main objectives of this study are to: (i) identify the most informative genes using five feature selection techniques (FST) and (ii) adopt six classifiers to classify the cancer disease and compare them. Leukemia cancer data has been taken from Kent ridge biomedical data repository, USA. There are 7129 genes and 72 patients. Among them, 47 patients are cancer and 25 are control. We have used five FST as t-test; Wilcoxon sign rank sum (WCSRS) test, random forest (RF), Boruta and least absolute shrinkage and selection operator (LASSO). We have also used six classifiers as Adaboost (AB), classification and regression tree (CART), artificial neural network (ANN), random forest (RF), linear discriminant analysis (LDA) and naive Bayes (NB). The performances of these classifiers are evaluated by accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), and F-measure (FM). We used simulated dataset to check the validity of proposed method. The results indicate that the combination of LASSO based FST and NB classifier gives the highest classification accuracy of 99.95%. On the basis of the results, we can conclude that the combination of LASSO based FST and NB classifier predicts the leukemia cancer more accurately compare to any other combination of FST and classifiers utilized in this study.}, year = {2020} }
TY - JOUR T1 - Feature Selection and Classification of Leukemia Cancer Using Machine Learning Techniques AU - Md. Alamgir Sarder AU - Md. Maniruzzaman AU - Benojir Ahammed Y1 - 2020/07/04 PY - 2020 N1 - https://doi.org/10.11648/j.mlr.20200502.11 DO - 10.11648/j.mlr.20200502.11 T2 - Machine Learning Research JF - Machine Learning Research JO - Machine Learning Research SP - 18 EP - 27 PB - Science Publishing Group SN - 2637-5680 UR - https://doi.org/10.11648/j.mlr.20200502.11 AB - Leukemia cancer is one of the most leading detrimental cancer diseases in worldwide. A huge number of genes are responsible for cancer diseases. Therefore, it is necessary to identify the most informative genes of Leukemia cancer. The main objectives of this study are to: (i) identify the most informative genes using five feature selection techniques (FST) and (ii) adopt six classifiers to classify the cancer disease and compare them. Leukemia cancer data has been taken from Kent ridge biomedical data repository, USA. There are 7129 genes and 72 patients. Among them, 47 patients are cancer and 25 are control. We have used five FST as t-test; Wilcoxon sign rank sum (WCSRS) test, random forest (RF), Boruta and least absolute shrinkage and selection operator (LASSO). We have also used six classifiers as Adaboost (AB), classification and regression tree (CART), artificial neural network (ANN), random forest (RF), linear discriminant analysis (LDA) and naive Bayes (NB). The performances of these classifiers are evaluated by accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), and F-measure (FM). We used simulated dataset to check the validity of proposed method. The results indicate that the combination of LASSO based FST and NB classifier gives the highest classification accuracy of 99.95%. On the basis of the results, we can conclude that the combination of LASSO based FST and NB classifier predicts the leukemia cancer more accurately compare to any other combination of FST and classifiers utilized in this study. VL - 5 IS - 2 ER -