OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Journal of Data Analysis and Information Processing 2023

A Hybrid Ensemble Learning Approach Utilizing Light Gradient Boosting Machine and Category Boosting Model for Lifestyle-Based Prediction of Type-II Diabetes Mellitus

DOI: 10.4236/jdaip.2023.114025, PP. 480-511

Mahadi Nagassou, Ronald Waweru Mwangi, Euna Nyarige

Keywords: Boosting Ensemble Learning, Category Boosting, Light Gradient Boosting Machine

Full-Text Cite this paper Add to My Lib

Abstract:

Addressing classification and prediction challenges, tree ensemble models have gained significant importance. Boosting ensemble techniques are commonly employed for forecasting Type-II diabetes mellitus. Light Gradient Boosting Machine (LightGBM) is a widely used algorithm known for its leaf growth strategy, loss reduction, and enhanced training precision. However, LightGBM is prone to overfitting. In contrast, CatBoost utilizes balanced base predictors known as decision tables, which mitigate overfitting risks and significantly improve testing time efficiency. CatBoost’s algorithm structure counteracts gradient boosting biases and incorporates an overfitting detector to stop training early. This study focuses on developing a hybrid model that combines LightGBM and CatBoost to minimize overfitting and improve accuracy by reducing variance. For the purpose of finding the best hyperparameters to use with the underlying learners, the Bayesian hyperparameter optimization method is used. By fine-tuning the regularization parameter values, the hybrid model effectively reduces variance (overfitting). Comparative evaluation against LightGBM, CatBoost, XGBoost, Decision Tree, Random Forest, AdaBoost, and GBM algorithms demonstrates that the hybrid model has the best F1-score (99.37%), recall (99.25%), and accuracy (99.37%). Consequently, the proposed framework holds promise for early diabetes prediction in the healthcare industry and exhibits potential applicability to other datasets sharing similarities with diabetes.

References

[1]	American Diabetes Association (2021) 5. Facilitating Behavior Change and Well-Being to Improve Health Outcomes: Standards of Medical Care in Diabetes—2021. Diabetes Care, 44, S53-S72. https://doi.org/10.2337/dc21-S005
[2]	Friedman, J.H. (2001) Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 29, 1189-1232. https://doi.org/10.1214/aos/1013203451
[3]	Ke, G.L., Meng, Q., Finley, T., Wang, T.F., Chen, W., Ma, W.D., Ye, Q.W. and Liu, T.Y. (2017) LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 3149-3157.
[4]	Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V. and Gulin, A. (2018) Catboost: Unbiased Boosting with Categorical Features. Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, 3-8 December 2018, 6639-6649.
[5]	Snoek, J., Larochelle, H. and Adams, R.P. (2012) Practical Bayesian Optimization of Machine Learning Algorithms. Proceedings of the 25th International Conference on Neural Information Processing Systems, Nevada, 3-6 December 2012, 2951-2959.
[6]	Zhang, G., Xu, J.M., Yu, M., Yuan, J. and Chen, F. (2020) A Machine Learning Approach for Mortality Prediction Only Using Non-Invasive Parameters. Medical & Biological Engineering & Computing, 58, 2195-2238. https://doi.org/10.1007/s11517-020-02174-0
[7]	Ganie, S.M., Malik, M.B. and Arif, T. (2022) Machine Learning Techniques for Diagnosis of Type 2 Diabetes Using Lifestyle Data. In: Khanna, A., Gupta, D., Bhattacharyya, S., Ella Hassanien, A., Anand, S. and Jaiswal, A., Eds., International Conference on Innovative Computing and Communications, Springer, Singapore, 487-497. https://doi.org/10.1007/978-981-16-3071-2_39
[8]	Kopitar, L., Kocbek, P., Cilar, L., Sheikh, A. and Stiglic, G. (2020) Early Detection of Type 2 Diabetes Mellitus Using Machine Learning-Based Prediction Models. Scientific Reports, 10, Article No. 11981. https://doi.org/10.1038/s41598-020-68771-z
[9]	Ahmed, S., Shaikh, S., Ikram, F., Fayaz, M., Alwageed, H.S., Khan, F., Hassan Jaskani, F., et al. (2022) Prediction of Cardiovascular Disease on Self-Augmented Datasets of Heart Patients Using Multiple Machine Learning Models. Journal of Sensors, 2022, Article ID: 3730303. https://doi.org/10.1155/2022/3730303
[10]	Hasan, M.K., Alam, M.A., Das, D., Hossain, E. and Hasan, M. (2020) Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers. IEEE Access, 8, 76516-76531. https://doi.org/10.1109/ACCESS.2020.2989857
[11]	Rawat, V. (2019) A Classification System for Diabetic Patients with Machine Learning Techniques. International Journal of Mathematical, Engineering and Management Sciences, 4, 729-744. https://doi.org/10.33889/IJMEMS.2019.4.3-057
[12]	Zhang, L.Y., Wang, Y.K., Niu, M.M., Wang, C.J. and Wang, Z.F. (2020) Machine Learning for Characterizing Risk of Type 2 Diabetes Mellitus in a Rural Chinese Population: The Henan Rural Cohort Study. Scientific Reports, 10, Article No. 4406. https://doi.org/10.1038/s41598-020-61123-x
[13]	Ganie, S.M. and Malik, M.B. (2022) An Ensemble Machine Learning Approach for Predicting Type-II Diabetes Mellitus Based on Lifestyle Indicators. Healthcare Analytics, 2, Article ID: 100092. https://doi.org/10.1016/j.health.2022.100092
[14]	Hasan, M.K., Alam, M.A., Das, D., Hossain, E. and Hasan, M. (2020) Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers. IEEE Access, 8, 76516-76531. https://doi.org/10.1109/ACCESS.2020.2989857
[15]	Kaur, P. and Sharma, M. (2018) Analysis of Data Mining and Soft Computing Techniques in Prospecting Diabetes Disorder in Human Beings: A Review. International Journal of Pharmaceutical Science and Research, 9, 2700-2719, 2018.
[16]	Sewell, M. (2008) Ensemble Learning. http://machine-learning.martinsewell.com/ensembles/ensemble-learning.pdf
[17]	Sagi, O. and Rokach, L. (2018) Ensemble Learning: A Survey. WIREs Data Mining and Knowledge Discovery, 8, e1249. https://doi.org/10.1002/widm.1249
[18]	Basaran, K., Özçift, A. and Kılınç, D. (2019) A New Approach for Prediction of Solar Radiation with Using Ensemble Learning Algorithm. Arabian Journal for Science and Engineering, 44, 7159-7171. https://doi.org/10.1007/s13369-019-03841-7
[19]	Abou Omar, K.B. (2018) XGboost and LGBM for Porto Seguro’s Kaggle Challenge: A Comparison. https://pub.tik.ee.ethz.ch/students/2017-HS/SA-2017-98.pdf
[20]	Cui, S.Z., Yin, Y.Q., Wang, D.J., Li, Z.W. and Wang, Y.Z. (2021) A Stacking-Based Ensemble Learning Method for Earthquake Casualty Prediction. Applied Soft Computing, 101, Article ID: 107038. https://doi.org/10.1016/j.asoc.2020.107038
[21]	Ferreira, A.J. and Figueiredo, M.A.T. (2012) Boosting Algorithms: A Review of Methods, Theory, and Applications. In: Zhang, C. and Ma, Y., Eds., Ensemble Machine Learning, Springer, New York, 35-85. https://doi.org/10.1007/978-1-4419-9326-7_2
[22]	Mayr, A., Binder, H., Gefeller, O. and Schmid, M. (2014) The Evolution of Boosting Algorithms. Methods of Information in Medicine, 53, 419-427. https://doi.org/10.3414/ME13-01-0122
[23]	Dargahi-Zarandi, A., Hemmati-Sarapardeh, A., Shateri, M., Menad, N.A. and Ahmadi, M. (2020) Modeling Minimum Miscibility Pressure of Pure/Impure CO2-Crude Oil Systems Using Adaptive Boosting Support Vector Regression: Application to Gas Injection Processes. Journal of Petroleum Science and Engineering, 184, Article ID: 106499. https://doi.org/10.1016/j.petrol.2019.106499
[24]	Touzani, S., Granderson, J. and Fernandes, S. (2018) Gradient Boosting Machine for Modeling the Energy Consumption of Commercial Buildings. Energy and Buildings, 158, 1533-1543. https://doi.org/10.1016/j.enbuild.2017.11.039
[25]	Rawi, R., Mall, R., Kunji, K., Shen, C.H., Kwong, P.D. and Chuang, G.Y. (2018) Parsnip: Sequence-Based Protein Solubility Prediction Using Gradient Boosting Machine. Bioinformatics, 34, 1092-1098. https://doi.org/10.1093/bioinformatics/btx662
[26]	Nalluru, G., Pandey, R. and Purohit, H. (2019) Relevancy Classification of Multimodal Social Media Streams for Emergency Services. 2019 IEEE International Conference on Smart Computing (SMARTCOMP), Washington DC, 12-15 June 2019, 121-125. https://doi.org/10.1109/SMARTCOMP.2019.00040
[27]	Chen, P., Deng, Y.M., Zhang, X.G., Ma, L., Yan, Y.L., Wu, Y.F. and Li, C.S. (2022) Degradation Trend Prediction of Pumped Storage Unit Based on MIC-LGBM and VMD-GRU Combined Model. Energies, 15, Article 605. https://doi.org/10.3390/en15020605
[28]	Liang, W.Z., Luo, S.Z., Zhao, G.Y. and Wu, H. (2020) Predicting Hard Rock Pillar Stability Using GBDT, XGBoost, and LightGBM Algorithms. Mathematics, 8, Article 765. https://doi.org/10.3390/math8050765
[29]	Machado, M.R., Karray, S. and de Sousa, I.T. (2019) Lightgbm: An Effective Decision Tree Gradient Boosting Method to Predict Customer Loyalty in the Finance Industry. 2019 14th International Conference on Computer Science & Education (ICCSE), Toronto, 19-21 August 2019, 1111-1116. https://doi.org/10.1109/ICCSE.2019.8845529
[30]	Cheng, W., Li, J.L., Xiao, H.C. and Ji, L.N. (2022) Combination Predicting Model of Traffic Congestion Index in Weekdays Based on LightGBM-GRU. Scientific Reports, 12, Article No. 2912. https://doi.org/10.1038/s41598-022-06975-1
[31]	Hao, X.C., Zhang, Z.P., Xu, Q.Q., Huang, G.L. and Wang, K. (2022) Prediction of f-CaO Content in Cement Clinker: A Novel Prediction Method Based on LightGBM and Bayesian Optimization. Chemometrics and Intelligent Laboratory Systems, 220, Article ID: 104461. https://doi.org/10.1016/j.chemolab.2021.104461
[32]	Shahriar, S.A., Kayes, I., Hasan, K., Hasan, M., Islam, R., Awang, N.R., Hamzah, Z., Rak, A.E. and Salam, M.A. (2021) Potential of ARIMA-ANN, ARIMA-SVM, DT and CatBoost for Atmospheric PM2.5 Forecasting in Bangladesh. Atmosphere, 12, Article 100. https://doi.org/10.3390/atmos12010100
[33]	Kohavi, R. and Li, C.H. (1995) Oblivious Decision Trees, Graphs, and Top-Down Pruning. Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, 20-25 August 1995, 1071-1077.
[34]	Langley, P. and Sage, S. (1994) Oblivious Decision Trees and Abstract Cases. Working Notes of the AAAI-94 Workshop on Case-Based Reasoning, Seattle, 113-117.
[35]	Ferov, M. and Modr`y, M. (2016) Enhancing LambdaMART Using Oblivious Trees. arXiv: 1609.05610.
[36]	Gulin, A., Kuralenok, I. and Pavlov, D. (2011) Winning the Transfer Learning Track of Yahoo!’s Learning to Rank Challenge with YetiRank. Proceedings of the 2010 International Conference on Yahoo! Learning to Rank Challenge, Haifa, 25 June 2010, 63-76.
[37]	Dorogush, A.V., Ershov, V. and Gulin, A. (2018) Catboost: Gradient Boosting with Categorical Features Support. arXiv: 1810.11363.
[38]	Sibindi, R., Mwangi, R.W. and Waititu, A.G. (2022) A Boosting Ensemble Learning Based Hybrid Light Gradient Boosting Machine and Extreme Gradient Boosting Model for Predicting House Prices. Engineering Reports, 5, e12599. https://doi.org/10.1002/eng2.12599
[39]	Patel, V., Choe, S. and Halabi, T. (2020) Predicting Future Malware Attacks on Cloud Systems Using Machine Learning. 2020 IEEE 6th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), Baltimore, 25-27 May 2020, 151-156. https://doi.org/10.1109/BigDataSecurity-HPSC-IDS49724.2020.00036
[40]	Pace, R.K. and Barry, R. (1997) Sparse Spatial Autoregressions. Statistics & Probability Letters, 33, 291-297. https://doi.org/10.1016/S0167-7152(96)00140-X
[41]	Matthews, S. and Hartman, B. (2021) mSHAP: SHAP Values for Two-Part Models. Risks, 10, Article 3. https://doi.org/10.3390/risks10010003
[42]	Zhang, J. and Chen, L. (2019) Clustering-Based Undersampling with Random over Sampling Examples and Support Vector Machine for Imbalanced Classification of Breast Cancer Diagnosis. Computer Assisted Surgery, 24, 62-72. https://doi.org/10.1080/24699322.2019.1649074
[43]	Shelke, M.S., Deshmukh, P.R. and Shandilya, V.K. (2017) A Review on Imbalanced Data Handling Using Undersampling and Oversampling Technique. International Journal of Recent Trends in Engineering and Research, 3, 444-449. https://doi.org/10.23883/IJRTER.2017.3168.0UWXM
[44]	Dutta, D., Paul, D. and Ghosh, P. (2018) Analysing Feature Importances for Diabetes Prediction Using Machine Learning. 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, 1-3 November 2018, 924-928. https://doi.org/10.1109/IEMCON.2018.8614871
[45]	Maniruzzaman, M., Rahman, M.J., Al-MehediHasan, M., Suri, H.S., Abedin, M.M., El-Baz, A. and Suri, J.S. (2018) Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers. Journal of Medical Systems, 42, Article No. 92. https://doi.org/10.1007/s10916-018-0940-7
[46]	Anand, A. and Shakti, D. (2015) Prediction of Diabetes Based on Personal Lifestyle Indicators. 2015 1st International Conference on Next Generation Computing Technologies (NGCT), Dehradun, 4-5 September 2015, 673-676. https://doi.org/10.1109/NGCT.2015.7375206
[47]	Patil, R. and Shah, K. (2023) Machine Learning in Healthcare: Applications, Current Status, and Future Prospects. In: Mangla, M., Shinde, S.K., Mehta, V., Sharma, N. and Mohanty, S.N., Eds., Handbook of Research on Machine Learning, Apple Academic Press, New York, 163-186. https://doi.org/10.1201/9781003277330-8
[48]	Mujumdar, A. and Vaidehi, V. (2019) Diabetes Prediction Using Machine Learning Algorithms. Procedia Computer Science, 165, 292-299. https://doi.org/10.1016/j.procs.2020.01.047
[49]	Tigga, N.P. and Garg, S. (2020) Prediction of Type 2 Diabetes Using Machine Learning Classification Methods. Procedia Computer Science, 167, 706-716. https://doi.org/10.1016/j.procs.2020.03.336
[50]	Kowsher, M., Turaba, M.Y., Sajed, T. and Rahman, M.M.M. (2019) Prognosis and Treatment Prediction of Type-2 Diabetes Using Deep Neural Network and Machine Learning Classifiers. 2019 22nd International Conference on Computer and Information Technology (ICCIT), Dhaka, 18-20 December 2019, 1-6. https://doi.org/10.1109/ICCIT48885.2019.9038574
[51]	Muhammad, L.J., Algehyne, E.A. and Usman, S.S. (2020) Predictive Supervised Machine Learning Models for Diabetes Mellitus. SN Computer Science, 1, Article No. 240. https://doi.org/10.1007/s42979-020-00250-8

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133