DOI QR코드

DOI QR Code

Building a text collection for Urdu information retrieval

  • Rasheed, Imran (Department of Computer Science and Engineering, Indian Institute of Technology (ISM)) ;
  • Banka, Haider (Department of Computer Science and Engineering, Indian Institute of Technology (ISM)) ;
  • Khan, Hamaid M. (Aluteam, Fatih Sultan Mehmet Vakif University)
  • Received : 2019.10.16
  • Accepted : 2020.10.05
  • Published : 2021.10.01

Abstract

Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.

Keywords

Acknowledgement

We sincerely thank Mr Zahoor Ahmad Shora, who is the Chief Editor of "Daily Roshni," for his generous contribution of freely sharing raw data for our collection. We are also thankful to the students and scholars of the Department of Urdu and Linguistics, Aligarh Muslim University, Aligarh, who helped generate topics and evaluate relevance for our text collection.

References

  1. A. Hardie, Developing a tag-set for automated part-of-speech tagging in Urdu in Proc. Corpus Linguistics (Lancaster, UK), Mar. 2003.
  2. P. Baker et al., Corpus data for south asian language processing, in Proc. Workshop South Asian Lang. Process. (EACL), (Budapest, Hungary), Apr. 2003, pp. 1-8.
  3. K. Riaz, Baseline for Urdu IR evaluation, in Proc. ACM workshop Improving non english web searching (Napa Valley, CA, USA), Oct. 2008, pp. 97-100.
  4. A. Daud, W. Khan, and D. Che, Urdu language processing: A survey, Artif. Intell. Rev. 47 (2017), 279-311. https://doi.org/10.1007/s10462-016-9482-x
  5. M. Sharjeel, R. M. A. Nawab, and P. Rayson, Counter: Corpus of urdu news text reuse, Lang. Res. Eval. 51 (2017), 777-803. https://doi.org/10.1007/s10579-016-9367-2
  6. M. Humayoun, H. Hammarstrom, and A. Ranta, Urdu morphology, orthography and lexicon extraction, M.S. thesis, Department of Computer Science and Engineering, Chalmers tekniska hogskola, Goteborg, Sweden, 2006.
  7. V. Gupta, N. Joshi, and I. Mathur, Design & development of rule based inflectional and derivational Urdu stemmer, in Proc. Int. Conf, Futuristic Trends Comput. Anal. Knowl. Manag. (ABLAZE), (Greater Noida, India), Feb. 2015, pp. 7-12.
  8. I. Rasheed, H. Banka, and H. M. Khan, Pseudo-relevance feedback based query expansion using boosting algorithm, Artif. Intell. Rev. (2021), https://doi.org/10.1007/s10462-021-09972-4
  9. D. Becker and K. Riaz, A study in Urdu corpus construction, in Proc. Workshop Asian Lang. Resour. Int. Stand. vol. 12, (Stroudsburg, PA, USA), Aug. 2002, pp. 1-5.
  10. K. Riaz, Concept search in Urdu, in Proc. PhD workshop Inf. Knowl. Manag. (Napa Valley, CA, USA), Oct. 2008, pp. 33-40.
  11. S. Urooj et al., Cle Urdu digest corpus, in Proc. Conf. Lang. Technol. (SNLP), (Lahore, Pakistan), (2012), pp. 47-53.
  12. F. Baseer, A. Habib, and J. Ashraf, Romanized Urdu corpus development (rucd) model: Edit-distance based most frequent unique unigram extraction approach using real-time interactive dataset, in Proc. Int. Conf. Innov. Comput. Technol. (INTECH), (Dublin, Ireland), Aug. 2016, pp. 513-518.
  13. S. A. Ali et al., Salience analysis of news corpus using heuristic approach in Urdu language, Int. J. Comput. Sci. Netw. Secur. (IJCSNS), 16 (2016), no. 4, 28-36.
  14. Q. Abbas, Building a hierarchical annotated corpus of Urdu: The Urdu. kon-tb treebank, in International Conference on Intelligent Text Processing and Computational Linguistics, Springer, Berlin, Germany, 2012, pp. 66-79.
  15. M. Ijaz and S. Hussain, Corpus based Urdu lexicon development, in Proc. Conf. Lang. Technol. (CLT07), vol. 73, (Peshawar, Pakistan), Aug. 2007.
  16. I. Hanif et al., Cross-language Urduenglish (clue) text alignment corpus, in Proc. Working notes CLEF (Toulouse, France), Sept. 2015.
  17. R. Rahimi, A. Shakery, and I. King, Extracting translations from comparable corpora for cross-language information retrieval using the language modeling framework, Inf. Process. Manage, 52 (2016), no. 2, 299 -318. https://doi.org/10.1016/j.ipm.2015.08.001
  18. M. Karthikeyan and P. Aruna, Probability based document clustering and image clustering using content-based image retrieval, Appl. Soft Comp. 13 (2013), no. 2, 959 -966. https://doi.org/10.1016/j.asoc.2012.09.013
  19. Z. Ahmad et al., Urdu nastaleeq optical character recognition, World Acad. Sci., Eng. Technol. 26 (2007), pp. 249-252.
  20. M. Humayoun et al., Urdu summary corpus, in Proc. Int. Conf. Lang. Resour. Eval. (Reykjavik, Iceland), May 2014, pp. 796-800, https://github.com/humsh a/USCorpus
  21. Q. A. Akram, A. Naseer, and S. Hussain, Assasband, an affix-exception-list based Urdu stemmer, in Proc. Workshop Asian Lang. Resour. (Suntec, Singapore), Aug. 2009, pp. 40-47,
  22. I. Rasheed and H. Banka, Query expansion in information retrieval for Urdu language, in Proc. Int. Conf. Inf. Retr. Knowl. Manag. (CAMP), (Kota Kinabalu, Malaysia), Mar. 2018, pp. 171-176.
  23. I. Rasheed et al., Urdu text classification: A comparative study using machine learning techniques, in Proc. Int. Conf. Digit. Inf. Manag. (ICDIM) (Berlin, Germany), Sept. 2018, pp. 274-278.
  24. K. Batri, S. Lakshmi, and B. Sathiyabhama, Trade-off between the number of index-terms and the information retrieval system's performance, Kuwait J. Sci. 44 (2017), no. 4, 49-56.
  25. N. Craswell et al., Overview of the trec-2003 web track, in Proc. Text Retr. Conf. (TREC), vol. 3, (Gaithersburg, MD, USA), 2002.
  26. A. AleAhmad et al., Hamshahri: A standard persian text collection, Knowl. Based Syst. 22 (2009), no. 5, 382 -387. https://doi.org/10.1016/j.knosys.2009.05.002
  27. A. Kanapala and S. Pal, Test collection for legal ir from online discussion forums, in Proc. Forum Inf. Retr. Eval. (Bangalore, India), Dec. 2014, pp. 126-129.
  28. J. M. Ponte and W. B. Croft, A language modeling approach to information retrieval, in Proc Int. ACM SIGIR Conf. Res. Dev. Inf Retr. (Melbourne, Australia), Aug. 1998, pp. 275-281.
  29. I. Ounis et al., Terrier information retrieval platform, in Advances in Information Retrieval, vol. 3408, Springer, Berlin, Germany, 2005, pp. 517-519.
  30. E. M. Voorhees, Overview of trec 2003, in Proc. Text Retr. Conf. (TREC), (Gaithersburg, MD, USA), Nov. 2003, pp. 1-13, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=150467
  31. L. Cohen, L. Manion, and K. Morrison, The ethics of educational and social research, in Research Methods in Education, 8 th ed., Routledge, London, UK, 2013, https://doi.org/10.4324/9780203720967
  32. S. E. Robertson et al., Okapi at trec-4, in Proc. Text REtrieval Conf. (London, UK), Oct. 1996, pp. 73-96, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.3342
  33. G. Amati and C. J. Van Rijsbergen, Probabilistic models of information retrieval based on measuring the divergence from randomness, ACM Trans. Inf. Syst. (TOIS), 20 (2002), no. 4, 357-389. https://doi.org/10.1145/582415.582416
  34. G. Salton, A. Wong, and C. S. Yang, A vector space model for automatic indexing, Commun. ACM 18 (1975), no. 11, 613-620. https://doi.org/10.1145/361219.361220
  35. C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, vol. 999, MIT Press, Cambridge, MA, USA, 1999, https://nlp.stanf ord.edu/fsnlp/.
  36. P. Clough and M. Sanderson, Evaluating the performance of information retrieval systems using test collections, Inf. Res, 18 (2013), no. 2.
  37. W. B. Croft, D. Metzler, and T. Strohmann, Search Engines: Information Retrieval in Practice, Pearson Education, Boston, MA, USA, 2010.
  38. A. K. McCallum, Mallet: A machine learning for language toolkit, 2002, http://mallet.cs.umass.edu/.
  39. E. Frank et al., Weka-a machine learning workbench for data mining, in Data Mining and Knowledge Discovery Handbook, Springer, Boston, MA, USA, 2009, pp. 1269-1277.
  40. T. Zia, M. P. Akhter, and Q. Abbas, Comparative study of feature selection approaches for Urdu text categorization, Malaysian J. Comput. Sci, 28 (2015), no. 2, 93-109.
  41. I. Haneef et al., Design and development of a large cross-lingual plagiarism corpus for urdu-english language pair, Sci. Program. 2019 (2019), 1-11.
  42. N. Khan, M. P. Bakht, and R. A. Wagan, Corpus construction and structure study of Urdu language using empirical laws, in Proc. Int. Conf. Data Sci. (Karachi, Pakistan), Feb. 2019, pp. 9-14.
  43. S. Hussain, Resources for Urdu language processing, in Proc. Workshop Asian Lang. Resour. IJCNLP, (Hyderabad, India), Jan. 2008, pp. 99-100, https://www.aclweb.org/anthology/I08-7017.pdf