DOI QR코드

DOI QR Code

Relevancy contemplation in medical data analytics and ranking of feature selection algorithms

  • P. Antony Seba (Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam) ;
  • J. V. Bibal Benifa (Department of Computer Science and Engineering, Information Technology Kottayam)
  • Received : 2022.01.17
  • Accepted : 2022.05.23
  • Published : 2023.06.20

Abstract

This article performs a detailed data scrutiny on a chronic kidney disease (CKD) dataset to select efficient instances and relevant features. Data relevancy is investigated using feature extraction, hybrid outlier detection, and handling of missing values. Data instances that do not influence the target are removed using data envelopment analysis to enable reduction of rows. Column reduction is achieved by ranking the attributes through feature selection methodologies, namely, extra-trees classifier, recursive feature elimination, chi-squared test, analysis of variance, and mutual information. These methodologies are ranked via Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) using weight optimization to identify the optimal features for model building from the CKD dataset to facilitate better prediction while diagnosing the severity of the disease. An efficient hybrid ensemble and novel similarity-based classifiers are built using the pruned dataset, and the results are thereafter compared with random forest, AdaBoost, naive Bayes, k-nearest neighbors, and support vector machines. The hybrid ensemble classifier yields a better prediction accuracy of 98.31% for the features selected by extra tree classifier (ETC), which is ranked as the best by TOPSIS.

Keywords

References

  1. R. C. Chen, C. Dewi, S. W. Huang, and R. E. Caraka, Selecting critical features for data classification based on machine learning methods, J Big Data 7 (2020), no. 52.
  2. UCI, Chronic kidney disease data set. Available at: https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease [last accessed February 2020]
  3. Z. M. Hira and D. F. Gillies, A review of feature selection and feature extraction methods applied on microarray data, Adv. Bioinform. 2015 (2015), 198363.
  4. S. Rousseau and R. Rousseau, Data envelopment analysis as a tool for constructing scientometric indicators, Scientometrics. 40 (1997), 45-56. https://doi.org/10.1007/BF02459261
  5. N. Cui, J. Hu, and F. Liang, Complementary dimension reduction, Stat. Anal. Data. Min. 14 (2020), 1-10.
  6. B. M. Konopka, F. Lwow, M. Owczarz, and L. Laczmanski, Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data, PLOS One 13 (2018), e0201950.
  7. S. Friedrich, G. Antes, S. Behr, H. Binder, W. Brannath, F. Dumpert, K. Ickstadt, H. A. Kestler, J. Lederer, H. Leitgob, M. Pauly, A. Steland, A. Wilhelm, and T. Friede, Is there a role for statistics in artificial intelligence? Adv. Data Anal. Classif. (2021). https://doi.org/10.1007/s11634-021-00455-6
  8. A. Onan and S. Korukoglu, A feature selection model based on genetic rank aggregation for text sentiment classification, J. Inf. Sci. 43 (2017), 25-38. https://doi.org/10.1177/0165551515613226
  9. A. Onan, A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer, Expert Syst. Appl. 42 (2015), 6844-6852. https://doi.org/10.1016/j.eswa.2015.05.006
  10. A. Onan, Classifier and feature set ensembles for web page classification, J. Inf. Sci. 42 (2016), 150-165. https://doi.org/10.1177/0165551515591724
  11. A. Onan, Biomedical text categorization based on ensemble pruning and optimized topic modelling, Comp. Math. Methods Med. 2018 (2018), 2497471.
  12. M. Rostami, K. Berahmand, E. Nasiri, and S. Forouzande, Review of swarm intelligence-based feature selection methods, Eng. Appl. Artif. Intel. 100 (2021). https://doi.org/10.1016/j.engappai.2021.104210
  13. D. Mishra and S. Sharma, Performance analysis of dimensionality reduction techniques: A comprehensive review, in Advances in Mechanical Engineering. Lecture Notes in Mechanical Engineering, Springer, Singapore, 2021, pp. 639-651.
  14. R. Aziz, C. K. Verma, and N. Srivastava, Dimension reduction methods for microarray data: A review, AIMS Bioeng. 4 (2017), 179-197. https://doi.org/10.3934/bioeng.2017.1.179
  15. R. Aziz, C. K. Verma, and N. Srivastava, A novel approach for dimension reduction of microarray, Comput Biol. Chem. 71 (2017), 161-169. https://doi.org/10.1016/j.compbiolchem.2017.10.009
  16. R. A. Musheer, C. Verma, and N. Srivastava, Novel machine learning approach for classification of highdimensional microarray data, Soft Comput. 23 (2019), 13409-13421. https://doi.org/10.1007/s00500-019-03879-7
  17. R. Zebari, A. Abdulazeez, D. Zeebaree, D. Zebari, and J. Saeed, A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction, J. Appl. Sci. Technol. Trends. 1 (2020), 56-70. https://doi.org/10.38094/jastt1224
  18. W. Young, G. Weckman, and W. Holland, A survey of methodologies for the treatment of missing values within datasets: Limitations and benefits, Theor. Issues Ergon. Sci. 12 (2011), 15-43. https://doi.org/10.1080/14639220903470205
  19. J. Tang, Z. Chen, A. W. Fu, and D. W. Cheung, Capabilities of outlier detection schemes in large datasets, framework and methodologies, Knowl. Inf. Syst. 11 (2006), 45-84. https://doi.org/10.1007/s10115-005-0233-6
  20. S. Thudumu, P. Branch, J. Jin, and J. Singh, A comprehensive survey of anomaly detection techniques for high dimensional big data, J. Big Data 7 (2020).
  21. X. Xu, H. Liu, L. Li, and M. Yao, A comparison of outlier detection techniques for high-dimensional data, Int. J. Comp. Intell. Syst. 11 (2018), 652-662. https://doi.org/10.2991/ijcis.11.1.50
  22. S. Tshering, T. Okazaki, and S. Endo, A method to identify missing data mechanism in incomplete dataset, Int. J. Comput. Sci. Network Sec. 13 (2013).
  23. Y. A. Ozcan and K. Tone, Health care benchmarking and performance evaluation: An assessment using data envelopment analysis (DEA), 2nd ed., Springer, New York, NY, USA, 2014.
  24. E. Thanassoulis, K. D. Witte, J. Johnes, G. Johnes, G. Karagianni, and C. S. Portela, Applications of data envelopment analysis in education, In Data envelopment analysis. International series in operations research & management science, Vol. 238, Springer, Boston, MA, USA, 2016.
  25. D. Jain and V. Singh, Feature selection and classification systems for chronic disease prediction: A review, Egypt Inform. J. 19 (2018), 179-189. https://doi.org/10.1016/j.eij.2018.03.002
  26. X. Wang, Y. Yan, and X. Ma, Feature selection method based on differential correlation information entropy, Neural Process Lett. 52 (2020), 1339-1358. https://doi.org/10.1007/s11063-020-10307-7
  27. M. S. Wibawa, I. M. D. Maysanjaya, and I. M. A. W. Putra, Boosted classifier and features selection for enhancing chronic kidney disease diagnose, (International Conference on Cyber and IT Service Management, Denpasar, Indonesia), Aug. 2017. https://doi.org/10.1109/CITSM.2017.8089245
  28. D. Grissa, M. Petera, M. Brandolini, A. Napoli, B. Comte, and E. P. Guillot, Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data, Front. Mol. Biosci. 3 (2016). https://doi.org/10.3389/fmolb.2016.00030
  29. J. Qin, L. Chen, Y. Liu, C. Liu, C. Feng, and B. Chen, A machine learning methodology for diagnosing chronic kidney disease, IEEE Access. 8 (2020), 20991-21002. https://doi.org/10.1109/ACCESS.2019.2963053
  30. M. Esfandiari and M. Rizvandi, An application of TOPSIS method for ranking different strategic planning methodology, Manag. Sci. Lett. 4 (2014), 1445-1448. https://doi.org/10.5267/j.msl.2014.6.022
  31. K DIGO, KDIGO clinical practice guideline on the ecaluation and management of candidates for kidney transplantation, 2018.Available at: https://kdigo.org/wp-content/uploads/2018/08/KDIGO-Txp-Candidate-GL-Public-Review-Draft-Oct-22.pdf
  32. N. Saravanan, G. Sathish, and J. M. Balajee, Data wrangling and data leakage in machine learning for healthcare, J. Emerg. Technol. Innov. Res. 5 (2018), https://www.jetir.org/papers/JETIRC006413.pdf
  33. F. Farias, T. Ludermir, and C. B. Filho, Similarity based stratified splitting: an approach to train better classifiers, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2010.06099
  34. J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, Boosting methods for multi-class imbalanced data classification: an experimental review, J. Big Data 7 (2020). https://doi.org/10.1186/s40537-020-00349-y
  35. M. Bader-El-Den, E. Teitei, and T. Perry, Biased random forest for dealing with the class imbalance problem, IEEE Trans. Neural Netw. Learn. Syst. 30 (2019), 2163-2172.  https://doi.org/10.1109/TNNLS.2018.2878400