Relevancy contemplation in medical data analytics and ranking of feature selection algorithms

P. Antony Seba;J. V. Bibal Benifa;

doi:10.4218/etrij.2022-0018

ETRI Journal

Volume 45 Issue 3
/
Pages.448-461
/
2023
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Relevancy contemplation in medical data analytics and ranking of feature selection algorithms

P. Antony Seba (Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam) ;
J. V. Bibal Benifa (Department of Computer Science and Engineering, Information Technology Kottayam)

Received : 2022.01.17
Accepted : 2022.05.23
Published : 2023.06.20

https://doi.org/10.4218/etrij.2022-0018 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This article performs a detailed data scrutiny on a chronic kidney disease (CKD) dataset to select efficient instances and relevant features. Data relevancy is investigated using feature extraction, hybrid outlier detection, and handling of missing values. Data instances that do not influence the target are removed using data envelopment analysis to enable reduction of rows. Column reduction is achieved by ranking the attributes through feature selection methodologies, namely, extra-trees classifier, recursive feature elimination, chi-squared test, analysis of variance, and mutual information. These methodologies are ranked via Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) using weight optimization to identify the optimal features for model building from the CKD dataset to facilitate better prediction while diagnosing the severity of the disease. An efficient hybrid ensemble and novel similarity-based classifiers are built using the pruned dataset, and the results are thereafter compared with random forest, AdaBoost, naive Bayes, k-nearest neighbors, and support vector machines. The hybrid ensemble classifier yields a better prediction accuracy of 98.31% for the features selected by extra tree classifier (ETC), which is ranked as the best by TOPSIS.

Keywords

References

R. C. Chen, C. Dewi, S. W. Huang, and R. E. Caraka, Selecting critical features for data classification based on machine learning methods, J Big Data 7 (2020), no. 52.
UCI, Chronic kidney disease data set. Available at: https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease [last accessed February 2020]
Z. M. Hira and D. F. Gillies, A review of feature selection and feature extraction methods applied on microarray data, Adv. Bioinform. 2015 (2015), 198363.
S. Rousseau and R. Rousseau, Data envelopment analysis as a tool for constructing scientometric indicators, Scientometrics. 40 (1997), 45-56. https://doi.org/10.1007/BF02459261
N. Cui, J. Hu, and F. Liang, Complementary dimension reduction, Stat. Anal. Data. Min. 14 (2020), 1-10.
B. M. Konopka, F. Lwow, M. Owczarz, and L. Laczmanski, Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data, PLOS One 13 (2018), e0201950.
S. Friedrich, G. Antes, S. Behr, H. Binder, W. Brannath, F. Dumpert, K. Ickstadt, H. A. Kestler, J. Lederer, H. Leitgob, M. Pauly, A. Steland, A. Wilhelm, and T. Friede, Is there a role for statistics in artificial intelligence? Adv. Data Anal. Classif. (2021). https://doi.org/10.1007/s11634-021-00455-6
A. Onan and S. Korukoglu, A feature selection model based on genetic rank aggregation for text sentiment classification, J. Inf. Sci. 43 (2017), 25-38. https://doi.org/10.1177/0165551515613226
A. Onan, A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer, Expert Syst. Appl. 42 (2015), 6844-6852. https://doi.org/10.1016/j.eswa.2015.05.006
A. Onan, Classifier and feature set ensembles for web page classification, J. Inf. Sci. 42 (2016), 150-165. https://doi.org/10.1177/0165551515591724
A. Onan, Biomedical text categorization based on ensemble pruning and optimized topic modelling, Comp. Math. Methods Med. 2018 (2018), 2497471.
M. Rostami, K. Berahmand, E. Nasiri, and S. Forouzande, Review of swarm intelligence-based feature selection methods, Eng. Appl. Artif. Intel. 100 (2021). https://doi.org/10.1016/j.engappai.2021.104210
D. Mishra and S. Sharma, Performance analysis of dimensionality reduction techniques: A comprehensive review, in Advances in Mechanical Engineering. Lecture Notes in Mechanical Engineering, Springer, Singapore, 2021, pp. 639-651.
R. Aziz, C. K. Verma, and N. Srivastava, Dimension reduction methods for microarray data: A review, AIMS Bioeng. 4 (2017), 179-197. https://doi.org/10.3934/bioeng.2017.1.179
R. Aziz, C. K. Verma, and N. Srivastava, A novel approach for dimension reduction of microarray, Comput Biol. Chem. 71 (2017), 161-169. https://doi.org/10.1016/j.compbiolchem.2017.10.009
R. A. Musheer, C. Verma, and N. Srivastava, Novel machine learning approach for classification of highdimensional microarray data, Soft Comput. 23 (2019), 13409-13421. https://doi.org/10.1007/s00500-019-03879-7
R. Zebari, A. Abdulazeez, D. Zeebaree, D. Zebari, and J. Saeed, A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction, J. Appl. Sci. Technol. Trends. 1 (2020), 56-70. https://doi.org/10.38094/jastt1224
W. Young, G. Weckman, and W. Holland, A survey of methodologies for the treatment of missing values within datasets: Limitations and benefits, Theor. Issues Ergon. Sci. 12 (2011), 15-43. https://doi.org/10.1080/14639220903470205
J. Tang, Z. Chen, A. W. Fu, and D. W. Cheung, Capabilities of outlier detection schemes in large datasets, framework and methodologies, Knowl. Inf. Syst. 11 (2006), 45-84. https://doi.org/10.1007/s10115-005-0233-6
S. Thudumu, P. Branch, J. Jin, and J. Singh, A comprehensive survey of anomaly detection techniques for high dimensional big data, J. Big Data 7 (2020).
X. Xu, H. Liu, L. Li, and M. Yao, A comparison of outlier detection techniques for high-dimensional data, Int. J. Comp. Intell. Syst. 11 (2018), 652-662. https://doi.org/10.2991/ijcis.11.1.50
S. Tshering, T. Okazaki, and S. Endo, A method to identify missing data mechanism in incomplete dataset, Int. J. Comput. Sci. Network Sec. 13 (2013).
Y. A. Ozcan and K. Tone, Health care benchmarking and performance evaluation: An assessment using data envelopment analysis (DEA), 2nd ed., Springer, New York, NY, USA, 2014.
E. Thanassoulis, K. D. Witte, J. Johnes, G. Johnes, G. Karagianni, and C. S. Portela, Applications of data envelopment analysis in education, In Data envelopment analysis. International series in operations research & management science, Vol. 238, Springer, Boston, MA, USA, 2016.
D. Jain and V. Singh, Feature selection and classification systems for chronic disease prediction: A review, Egypt Inform. J. 19 (2018), 179-189. https://doi.org/10.1016/j.eij.2018.03.002
X. Wang, Y. Yan, and X. Ma, Feature selection method based on differential correlation information entropy, Neural Process Lett. 52 (2020), 1339-1358. https://doi.org/10.1007/s11063-020-10307-7
M. S. Wibawa, I. M. D. Maysanjaya, and I. M. A. W. Putra, Boosted classifier and features selection for enhancing chronic kidney disease diagnose, (International Conference on Cyber and IT Service Management, Denpasar, Indonesia), Aug. 2017. https://doi.org/10.1109/CITSM.2017.8089245
D. Grissa, M. Petera, M. Brandolini, A. Napoli, B. Comte, and E. P. Guillot, Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data, Front. Mol. Biosci. 3 (2016). https://doi.org/10.3389/fmolb.2016.00030
J. Qin, L. Chen, Y. Liu, C. Liu, C. Feng, and B. Chen, A machine learning methodology for diagnosing chronic kidney disease, IEEE Access. 8 (2020), 20991-21002. https://doi.org/10.1109/ACCESS.2019.2963053
M. Esfandiari and M. Rizvandi, An application of TOPSIS method for ranking different strategic planning methodology, Manag. Sci. Lett. 4 (2014), 1445-1448. https://doi.org/10.5267/j.msl.2014.6.022
K DIGO, KDIGO clinical practice guideline on the ecaluation and management of candidates for kidney transplantation, 2018.Available at: https://kdigo.org/wp-content/uploads/2018/08/KDIGO-Txp-Candidate-GL-Public-Review-Draft-Oct-22.pdf
N. Saravanan, G. Sathish, and J. M. Balajee, Data wrangling and data leakage in machine learning for healthcare, J. Emerg. Technol. Innov. Res. 5 (2018), https://www.jetir.org/papers/JETIRC006413.pdf
F. Farias, T. Ludermir, and C. B. Filho, Similarity based stratified splitting: an approach to train better classifiers, arXiv preprint, 2020. https://doi.org/10.48550/arXiv.2010.06099
J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, Boosting methods for multi-class imbalanced data classification: an experimental review, J. Big Data 7 (2020). https://doi.org/10.1186/s40537-020-00349-y
M. Bader-El-Den, E. Teitei, and T. Perry, Biased random forest for dealing with the class imbalance problem, IEEE Trans. Neural Netw. Learn. Syst. 30 (2019), 2163-2172. https://doi.org/10.1109/TNNLS.2018.2878400

ETRI Journal

Relevancy contemplation in medical data analytics and ranking of feature selection algorithms

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)