全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Smart Approaches to Efficient Text Mining for Categorizing Sexual Reproductive Health Short Messages into Key Themes

DOI: 10.4236/ojapps.2024.142037, PP. 511-532

Keywords: Knowledge Discovery in Text (KDT), Sexual Reproductive Health (SRH), Text Categorization, Text Classification, Text Extraction, Text Mining, Feature Extraction, Automated Classification Process, Performance, Stemming and Lemmatization, Natural Language Processing (NLP)

Full-Text   Cite this paper   Add to My Lib

Abstract:

To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved access to information on various Sexual Reproductive Health topics through Short Messaging Service (SMS) messages. Over the years, the platform has accumulated millions of incoming and outgoing messages, which need to be categorized into key thematic areas for better tracking of sexual reproductive health knowledge gaps among young people. The current manual categorization process of these text messages is inefficient and time-consuming and this study aims to automate the process for improved analysis using text-mining techniques. Firstly, the study investigates the current text message categorization process and identifies a list of categories adopted by counselors over time which are then used to build and train a categorization model. Secondly, the study presents a proof of concept tool that automates the categorization of U-report messages into key thematic areas using the developed categorization model. Finally, it compares the performance and effectiveness of the developed proof of concept tool against the manual system. The study used a dataset comprising 206,625 text messages. The current process would take roughly 2.82 years to categorise this dataset whereas the trained SVM model would require only 6.4 minutes while achieving an accuracy of 70.4% demonstrating that the automated method is significantly faster, more scalable, and consistent when compared to the current manual categorization. These advantages make the SVM model a more efficient and effective tool for categorizing large unstructured text datasets. These results and the proof-of-concept tool developed demonstrate the potential for enhancing the efficiency and accuracy of message categorization on the Zambia U-report platform and other similar text messages-based platforms.

References

[1]  Planned Parenthood Association of Zambia (2018) Passport to Health.
[2]  World Health Organization Africa (2015) Report on the Regional Meeting to Take Stock of the Progress Made in Adolescent Sexual and Reproductive Health and Rights, in the 20 Years since the International Conference on Population and Development, and on the Opportunities and Challenges in Moving the Agenda Forward. Visualizing the Problems and Generating Solutions for Adolescent Health in the African Region, Congo Brazzaville.
https://www.afro.who.int/sites/default/files/2018-05/ASRH-%20AFRO%20-%20AH%20workshop%20report.pdf
[3]  Ministry of Health (2011) Adolescent Health Strategic Plan 2011 to 2015, Lusaka Zambia.
https://zambia.unfpa.org/sites/default/files/pub-pdf/ZambiaAdolescentHealthStrategicPlan2011-2015.pdf
[4]  Poole, D.L. and Mackworth, A.K. (2017) What Is Artificial Intelligence? In: Poole, D.L. and Mackworth, A.K., Eds., Artificial Intelligence: Foundations of Computational Agents, 2nd Edition, Cambridge University Press, Cambridge, 9.
https://doi.org/10.1017/9781108164085
[5]  Duan, Y., Edward, J.S. and Dwivedi, Y.K. (2019) Artificial Intelligence for Decision Making in the Era of Big Data-Evolution, Challenges and Research Agenda. International Journal of Information Management, 48, 63-71.
https://doi.org/10.1016/j.ijinfomgt.2019.01.021
[6]  Gupta, V. and Lehal, G.S. (2009) A Survey of Text Mining Techniques and Applications. Journal of Emerging Technologies in Web Intelligence, 1, 60-76.
https://doi.org/10.4304/jetwi.1.1.60-76
[7]  Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D.T., Gutierrez, J.B. and Kochut, K. (2017) A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. Proceedings of KDD Bigdas, Halifax, August 2017, 1-13.
https://arXiv:1707.02919v2
[8]  Dang, S. and Ahmad, P. H. (2015) A Review of Text Mining Techniques Associated with Various Application Areas. International Journal of Science and Research, 4, 2461-2466.
[9]  Zambia U-Report Database December 2012 to July 2019.
https://www.unicef.org/esa/documents/zambia-u-reports
[10]  Manning, C.D., Raghavan, P. and Schutze, H. (2008) Boolean Retrieval in Introduction to Information Retrieval. Cambridge University Press, New York, 1-17.
[11]  What Is Information Retrieval? What Does Information Retrieval Mean?
https://www.youtube.com/watch?v=kVD54hmeTV8
[12]  Ceri, S., Bozzon, A., Brambilla, M., Valle, E.D., Fraternali, P. and Quarteroni, S. (2013) The Information Retrieval Process. In: Ceri, S., Bozzon, A., Brambilla, M., Valle, E.D., Fraternali, P. and Quarteroni, S., Eds., Web Information Retrieval, Springer, Berlin, 13-26.
https://doi.org/10.1007/978-3-642-39314-3
[13]  Hobbs, J.R. and Rilo, E. (2010) Information Extraction. In: Handbook of Natural Language Processing, Chapman & Hall, Boca Raton, 511-532.
[14]  Wimalasuriya, D.C. and Dou, D. (2010) Ontology-Based Information Extraction: An Introduction and a Survey of Current Approaches. Journal of Information Science, 36, 306-323.
https://doi.org/10.1177/0165551509360123
[15]  Russell, S.J. and Norvig, P. (2016) Information Extraction. In: Artificial Intelligence: A Modern Approach, 3rd Edition, Pearson Education Limited, Harlow, 873-882.
[16]  Korde, V. and Mahender, C.N. (2012) Text Classifications and Classifiers—A Survey. International Journal of Artificial Intelligence & Applications, 3, 85-99.
https://doi.org/10.5121/ijaia.2012.3208
[17]  Wei, G., Gao, X. and Wu, S. (2010) Study of Text Classification Methods for Data Sets with Huge Features. Proceedings 2nd International Conference on Industrial and Information Systems, Dalian, 10-11 July 2010, 433-436.
https://doi.org/10.1109/INDUSIS.2010.5565817
[18]  Mironczuk, M.M. and Protasiewicz, J. (2018) A Recent Overview of the State-of-the-Art Elements of Text Classification. Expert Systems with Applications, 106, 36-54.
https://doi.org/10.1016/j.eswa.2018.03.058
[19]  Altexsoft, Labelling Approaches, 29 Mar. 2018.
https://www.altexsoft.com/blog/datascience/how-to-organize-data-labeling-for-machine-learning-approaches-and-tools/
[20]  Lillywhite, K., Lee, D., Tippetts, B. and Archibald, J. (2013) A Feature Construction Method for General Object Recognition. Pattern Recognition, 46, 3300-3314.
https://doi.org/10.1016/j.patcog.2013.06.002
[21]  Pandaya, D., Amorimb, R.C. and Lanea, P. (2018) Feature Weighting as a Tool for Unsupervised Feature Selection. Information Processing Letters, 129, 44-52.
https://doi.org/10.1016/j.ipl.2017.09.005
[22]  Zafra, M.F. (2019, June 16) Text Classification in Python. Towards Data Science.
https://towardsdatascience.com/text-classification-in-python-dd95d264c802
[23]  Montejo-Raez, A. (2005) Automatic Text Categorization of Documents in the High Energy Physics Domain. MS Thesis, Universidad de Granada, Granada.
https://hera.ugr.es/tesisugr/15903837.pdf
[24]  Hall, M.A. (1999) Correlation-Based Feature Selection for Machine Learning. PhD Dissertation, Dept. of Computer Sc., Univ. of Waikato, Hamilton.
https://www.lri.fr/~pierres/donn%E9es/save/these/articles/lpr-queue/hall99correlationbased.pdf
[25]  Bell, J. (2015) What Is Machine Learning? In: Machine Learning: Hands-On for Developers and Technical Professionals, John Wiley & Sons, Inc., Hoboken, 1-16.
[26]  Fumo, D. (2017, June 15) Types of Machine Learning Algorithms You Should Know.
https://towardsdatascience.com/types-of-machine-learning-algorithms-you-should-know-953a08248861
[27]  Lundborg, A. (2017) Text Classification of Short Messages. M.S. Thesis, Dept. of Computer Sc., Lund University, Lund.
https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8928009&fileOId=8928011
[28]  Bansal, S. (2013) A Comprehensive Guide to Understand and Implement Text Classification in Python.
https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/
[29]  Campbell, C. and Ying, Y. (2011) Learning with Support Vector Machines. Springer, Berlin, 1-21.
https://doi.org/10.1007/978-3-031-01552-6_1
[30]  Mood, C. (2009) Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do about It. European Sociological Review, 26, 67-82.
https://doi.org/10.1093/esr/jcp006
[31]  Banerjee, M., Filson, C., Xia, R. and Miller, D.C. (2014) Logic Regression for Provider Effects on Kidney Cancer Treatment Delivery. Computational and Mathematical Methods in Medicine, 2014, Article ID: 316935.
https://doi.org/10.1155/2014/316935
[32]  Scikit-Learn, Decision Trees.
https://scikit-learn.org/stable/modules/tree.html
[33]  Myles, A.J., Feudale, R.N., Liu, Y., Woody, N.A. and Brown, S.D. (2004) An Introduction to Decision Tree Modeling. Journal of Chemometrics, 18, 275-285.
https://doi.org/10.1002/cem.873
[34]  Chen, Y. and Hao, Y. (2017) A Feature Weighted Support Vector Machine and K-Nearest Neighbor Algorithm for Stock Market Indices Prediction. Expert Systems with Applications, 80, 340-355.
https://doi.org/10.1016/j.eswa.2017.02.044
[35]  Teixeira, L.A. and Inácio de Oliveira, A.L. (2010) A Method for Automatic Stock Trading Combining Technical Analysis and Nearest Neighbor Classification. Expert Systems with Applications, 37, 6885-6890.
https://doi.org/10.1016/j.eswa.2010.03.033
[36]  Knox, K., Nyirenda, M. and Kabemba, M. (2019) Data Mining for Fraud Detection in Large Scale Financial Transactions. Proceedings of the International Conference in ICT (ICICT2019), Lusaka, 20-21 November 2019, 172-177.
[37]  Chiwamba, S.H., Phiri, J., Nkunika, P.O.Y., Nyirenda, M., Kabemba, M.M. and Sohati, P.H. (2019) Machine Learning Algorithms for Automated Image Capture and Identification of Fall Armyworm (FAW) Moths. Zambia ICT Journal, 3, 1-4.
https://doi.org/10.33260/zictjournal.v3i1.69
[38]  Chulu, F., Phiri, J., Nyirenda, M., Kabemba, M.M., Nkunika, P. and Chiwamba, S. (2019) Developing an Automatic Identification and Early Warning and Monitoring Web Based System of Fall Army Worm Based on Machine Learning in Developing Countries. Zambia ICT Journal, 3, 13-20.
https://doi.org/10.33260/zictjournal.v3i1.71
[39]  Lu, X. (2018) Natural Language Processing and Intelligent Computer-Assisted Language Learning (ICALL). In: Liontas, J.I., Ed., The TESOL Encyclopedia of English Language Teaching, John Wiley & Sons, Inc., Hoboken, 1-6.
https://doi.org/10.1002/9781118784235.eelt0422
[40]  Lane, H., Howard, C. and Hapke, H.M. (2019) Natural Language vs. Programming Language. In: Natural Language Processing: Understanding, Analyzing, and Generating Text with Python, Manning Publications Co., Shelter Island, 3-30.
[41]  Maynard, D., Bontcheva, K. and Augenstein, I. (2017) Introduction. In: Maynard, D., Bontcheva, K. and Augenstein, I., Eds., Natural Language Processing for the Semantic Web, Springer, Berlin, 1-8.
https://doi.org/10.1007/978-3-031-79474-2_1
[42]  Priyadarshini, S.B.B., Bagjadab, A.B. and Mishra, B.K. (2020) A Brief Overview of Natural Language Processing and Artificial Intelligence. In: Mishra, B.K. and Kumar, R., Eds., Natural Language Processing in Artificial Intelligence, AAP Inc., Palm Bay, 211-224.
https://doi.org/10.1201/9780367808495-8
[43]  Editorial Team (2019) A Quick Guide to Natural Language Processing (NLP), AI and Intelligent Automation.
https://www.intelligentautomation.network/learning-ml/articles/a-basic-guide-to-natural-language-processing-nlp
[44]  Taulli, T. (2019) Natural Language Processing (NLP): How Computers Talk. In: Taulli, T., Ed., Artificial Intelligence Basics: A Non-Technical Introduction, Apress, Monrovia, 103-124.
https://doi.org/10.1007/978-1-4842-5028-0_6
[45]  Koshorek, O., Cohen, A., Mor, N., Rotman, M. and Berant, J. (2018) Text Segmentation as a Supervised Learning Task. Proceedigns of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 2, 469-473.
https://doi.org/10.18653/v1/N18-2075
[46]  Chakravarthy, S. (2020) Tokenization for Natural Language Processing. Towards Data Science.
https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4
[47]  Jivani, A.G. (2011) A Comparative Study of Stemming Algorithms. International Journal of Circuit Theory and Applications, 2, 1930-1938.
[48]  Kübler, S., McDonald, R. and Nivre, J. (2009) Dependency Parsing. Synthesis Lectures on Human Language Technologies, Vol. 2, Springer, Berlin, 1-19.
https://doi.org/10.2200/S00169ED1V01Y200901HLT002
[49]  Thanaki, J. (2017) Feature Engineering and NLP Algorithms. In: Python Natural Language Processing: Explore NLP with Machine Learning and Deep Learning Techniques, Packt Publishing, Birmingham, 102-148.
[50]  Marshall, C. (2020) What Is Named Entity Recognition (NER) and How Can I Use It? Super.AI-AI & Human Data Labeling, AI Model Training & Deployment, 2019.
https://medium.com/mysuperai/what-is-named-entity-recognition-ner-and-how-can-i-use-it-2b68cf6f545d
[51]  Dwivedi, P. (2018) NLP: Extracting the Main Topics from Your Dataset Using LDA in Minutes. Towards Data Science.
https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
[52]  Rosa, K.D. and Ellen, J. (2009) Text Classification Methodologies Applied to Micro-Text in Military Chat. Proceedings International Conference on Machine Learning and Applications, Miami, 13-15 December 2009, 710-714.
https://doi.org/10.1109/ICMLA.2009.49
[53]  Balabantaray, R.C., Mohammad, M. and Sharma, N. (2012) Multi-Class Twitter Emotion Classification: A New Approach. International Journal of Accounting Information Systems, 4, 48-53.
https://doi.org/10.5120/ijais12-450651
[54]  Hayashida, Y., Uetsuji, T., Ebara, Y. and Koyamada, K. (2017) Category Classification of Text Data with Machine Learning Technique for Visualizing Flow of Conversation in Counseling. 2017 Nicograph International (NicoInt), Kyoto, 2-3 June 2017, 37-40.
https://doi.org/10.1109/NICOInt.2017.35
[55]  Poulin, C., Shiner, B., Thompson, P., Vepstas, L., Young-Xu, Y., Goertzel, B., Watts, B., Flashman, L. and McAllister, T. (2014) Predicting the Risk of Suicide by Analyzing the Text of Clinical Notes. PLOS ONE, 9, e0085733.
https://doi.org/10.1371/journal.pone.0085733
[56]  Koopman, B., Karimi, S., Nguyen, A., McGuire, R., Muscatello, D., Kemp, M., Truran, D., Zhang, M. and Thackway, S. (2015) Automatic Classification of Diseases from Free-Text Death Certificates for Real-Time Surveillance. BMC Medical Informatics and Decision Making, 15, Article No. 53.
https://doi.org/10.1186/s12911-015-0174-2
[57]  Lovins, J.B. (1968) A Comparative Study of Stemming Algorithms for Information Retrieval. ACM Computing Surveys, 4, 61-73.
[58]  International Labour Organization (1930, June 30) Convention Concerning the Reduction of Hours of Work to Forty per Week.
https://www.ilo.org/dyn/normlex/en/f?p=NORMLEXPUB:12100:0::NO::P12100_INSTRUMENT_ID:312175

Full-Text

Contact Us

[email protected]

QQ:3279437679

WhatsApp +8615387084133