Urdu News Classification using Application of Machine Learning Algorithms on News Headline

  • Khan, Muhammad Badruddin (Information Systems Department College of Computer and Information Sciences Imam Mohammad ibn Saud Islamic University (IMSIU))
  • 투고 : 2021.02.05
  • 발행 : 2021.02.28


Our modern 'information-hungry' age demands delivery of information at unprecedented fast rates. Timely delivery of noteworthy information about recent events can help people from different segments of life in number of ways. As world has become global village, the flow of news in terms of volume and speed demands involvement of machines to help humans to handle the enormous data. News are presented to public in forms of video, audio, image and text. News text available on internet is a source of knowledge for billions of internet users. Urdu language is spoken and understood by millions of people from Indian subcontinent. Availability of online Urdu news enable this branch of humanity to improve their understandings of the world and make their decisions. This paper uses available online Urdu news data to train machines to automatically categorize provided news. Various machine learning algorithms were used on news headline for training purpose and the results demonstrate that Bernoulli Naïve Bayes (Bernoulli NB) and Multinomial Naïve Bayes (Multinomial NB) algorithm outperformed other algorithms in terms of all performance parameters. The maximum level of accuracy achieved for the dataset was 94.278% by multinomial NB classifier followed by Bernoulli NB classifier with accuracy of 94.274% when Urdu stop words were removed from dataset. The results suggest that short text of headlines of news can be used as an input for text categorization process.



  1. "Urdu language | History, Script, & Words," Encyclopedia Britannica. (accessed Feb. 25, 2021).
  2. "Marti Hearst: What Is Text Mining?" (accessed Feb. 25, 2021).
  3. T. Joachims, "Text Categorization with Support Vector Machines," Proc Eur. Conf Mach. Learn. ECML98, Jan. 1998, doi: 10.17877/DE290R-5097.
  4. S. Dumais, J. Platt, D. Heckerman, and M. Sahami, "Inductive learning algorithms and representations for text categorization," in Proceedings of the seventh international conference on Information and knowledge management, New York, NY, USA, Nov. 1998, pp. 148-155, doi: 10.1145/288627.288651.
  5. A. Basu, C. Watters, and M. Author, "Support Vector Machines for Text Categorization.," Jan. 2003, p. 103, doi: 10.1109/HICSS.2003.1174243.
  6. Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng, "Some Effective Techniques for Naive Bayes Text Classification," IEEE Trans. Knowl. Data Eng., vol. 18, no. 11, pp. 1457-1466, Nov. 2006, doi: 10.1109/TKDE.2006.180.
  7. E. S. Tellez, D. Moctezuma, S. Miranda-Jimenez, and M. Graff, "An automated text categorization framework based on hyperparameter optimization," Knowl.-Based Syst., vol. 149, pp. 110-123, Jun. 2018, doi: 10.1016/j.knosys.2018.03.003.
  8. A. Dhar, H. Mukherjee, N. S. Dash, and K. Roy, "Text categorization: past and present," Artif. Intell. Rev., Sep. 2020, doi: 10.1007/s10462-020-09919-1.
  9. M. Al-diabat, "Arabic text categorization using classification rule mining," Appl. Math. Sci., vol. 6, no. 81, pp. 4033-4046, 2012.
  10. I. Hmeidi, M. Al-Ayyoub, N. Abdulla, A. Almodawar, R. Abooraig, and N. A. Ahmed, "Automatic Arabic text categorization: A comprehensive comparative study," J. Inf. Sci., vol. 41, pp. 114-124, Jan. 2014, doi: 10.1177/0165551514558172.
  11. K. Ahmed, M. Ali, S. Khalid, and M. Kamran, "Framework for Urdu News Headlines Classification.," J. Appl. Comput. Sci. Math., no. 21, 2016.
  12. S. A. Hamza, B. Tahir, and M. A. Mehmood, "Domain Identification of Urdu News Text," in 2019 22nd International Multitopic Conference (INMIC), Nov. 2019, pp. 1-7, doi: 10.1109/INMIC48123.2019.9022736.
  13. S. Hassan and A. Zaidi, "Urdu News Headline, Text Classification by Using Different Machine Learning Algorithms," Feb. 2019.
  14. K. Hussain, N. Mughal, I. Ali, S. Hassan, and S. M. Daudpota, "Urdu News Dataset 1M," vol. 3, Jan. 2021, doi: 10.17632/834vsxnb99.3.
  15. "Headline," Wikipedia. Feb. 11, 2021, Accessed: Feb. 25, 2021. [Online]. Available:
  16. "Tag cloud," Wikipedia. Jan. 23, 2021, Accessed: Feb. 25, 2021. [Online]. Available:
  17. "An Assessment of Tag Presentation Techniques." (accessed Feb. 25, 2021).
  18. "WordCloud for Python documentation - wordcloud 1.8.1 documentation." (accessed Feb. 26, 2021).
  19. "arabic_reshaper - Rust Package Registry." (accessed Feb. 26, 2021).
  20. "Urdu Stopwords List." (accessed Mar. 01, 2021).
  21. D. D. Lewis, "Naive (Bayes) at forty: The independence assumption in information retrieval," in Machine Learning: ECML-98, Berlin, Heidelberg, 1998, pp. 4-15, doi: 10.1007/BFb0026666.
  22. Y. H. Li and A. K. Jain, "Classification of Text Documents," Comput. J., vol. 41, no. 8, pp. 537-546, Jan. 1998, doi: 10.1093/comjnl/41.8.537.
  23. "An extensive empirical study of feature selection metrics for text classification | The Journal of Machine Learning Research." (accessed Feb. 28, 2021).
  24. "A Comparative Study on Feature Selection in Text Categorization | Proceedings of the Fourteenth International Conference on Machine Learning." (accessed Feb. 28, 2021).
  25. A. Mccallum and K. Nigam, "A Comparison of Event Models for Naive Bayes Text Classification," Work Learn Text Categ, vol. 752, May 2001.
  26. "Text categorization with Support Vector Machines: Learning with many relevant features | SpringerLink." (accessed Feb. 28, 2021).
  27. V. Vapnik, The Nature of Statistical Learning Theory, 2nd ed. New York: Springer-Verlag, 2000.
  28. F. Pedregosa et al., "Scikit-learn: Machine Learning in Python," J. Mach. Learn. Res., vol. 12, no. 85, pp. 2825-2830, 2011.