Building a text collection for Urdu information retrieval

Rasheed, Imran;Banka, Haider;Khan, Hamaid M.;

doi:10.4218/etrij.2019-0458

ETRI Journal

Volume 43 Issue 5
/
Pages.856-868
/
2021
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Building a text collection for Urdu information retrieval

Rasheed, Imran (Department of Computer Science and Engineering, Indian Institute of Technology (ISM)) ;
Banka, Haider (Department of Computer Science and Engineering, Indian Institute of Technology (ISM)) ;
Khan, Hamaid M. (Aluteam, Fatih Sultan Mehmet Vakif University)

Received : 2019.10.16
Accepted : 2020.10.05
Published : 2021.10.01

https://doi.org/10.4218/etrij.2019-0458 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.

Keywords

Acknowledgement

We sincerely thank Mr Zahoor Ahmad Shora, who is the Chief Editor of "Daily Roshni," for his generous contribution of freely sharing raw data for our collection. We are also thankful to the students and scholars of the Department of Urdu and Linguistics, Aligarh Muslim University, Aligarh, who helped generate topics and evaluate relevance for our text collection.

References

A. Hardie, Developing a tag-set for automated part-of-speech tagging in Urdu in Proc. Corpus Linguistics (Lancaster, UK), Mar. 2003.
P. Baker et al., Corpus data for south asian language processing, in Proc. Workshop South Asian Lang. Process. (EACL), (Budapest, Hungary), Apr. 2003, pp. 1-8.
K. Riaz, Baseline for Urdu IR evaluation, in Proc. ACM workshop Improving non english web searching (Napa Valley, CA, USA), Oct. 2008, pp. 97-100.
A. Daud, W. Khan, and D. Che, Urdu language processing: A survey, Artif. Intell. Rev. 47 (2017), 279-311. https://doi.org/10.1007/s10462-016-9482-x
M. Sharjeel, R. M. A. Nawab, and P. Rayson, Counter: Corpus of urdu news text reuse, Lang. Res. Eval. 51 (2017), 777-803. https://doi.org/10.1007/s10579-016-9367-2
M. Humayoun, H. Hammarstrom, and A. Ranta, Urdu morphology, orthography and lexicon extraction, M.S. thesis, Department of Computer Science and Engineering, Chalmers tekniska hogskola, Goteborg, Sweden, 2006.
V. Gupta, N. Joshi, and I. Mathur, Design & development of rule based inflectional and derivational Urdu stemmer, in Proc. Int. Conf, Futuristic Trends Comput. Anal. Knowl. Manag. (ABLAZE), (Greater Noida, India), Feb. 2015, pp. 7-12.
I. Rasheed, H. Banka, and H. M. Khan, Pseudo-relevance feedback based query expansion using boosting algorithm, Artif. Intell. Rev. (2021), https://doi.org/10.1007/s10462-021-09972-4
D. Becker and K. Riaz, A study in Urdu corpus construction, in Proc. Workshop Asian Lang. Resour. Int. Stand. vol. 12, (Stroudsburg, PA, USA), Aug. 2002, pp. 1-5.
K. Riaz, Concept search in Urdu, in Proc. PhD workshop Inf. Knowl. Manag. (Napa Valley, CA, USA), Oct. 2008, pp. 33-40.
S. Urooj et al., Cle Urdu digest corpus, in Proc. Conf. Lang. Technol. (SNLP), (Lahore, Pakistan), (2012), pp. 47-53.
F. Baseer, A. Habib, and J. Ashraf, Romanized Urdu corpus development (rucd) model: Edit-distance based most frequent unique unigram extraction approach using real-time interactive dataset, in Proc. Int. Conf. Innov. Comput. Technol. (INTECH), (Dublin, Ireland), Aug. 2016, pp. 513-518.
S. A. Ali et al., Salience analysis of news corpus using heuristic approach in Urdu language, Int. J. Comput. Sci. Netw. Secur. (IJCSNS), 16 (2016), no. 4, 28-36.
Q. Abbas, Building a hierarchical annotated corpus of Urdu: The Urdu. kon-tb treebank, in International Conference on Intelligent Text Processing and Computational Linguistics, Springer, Berlin, Germany, 2012, pp. 66-79.
M. Ijaz and S. Hussain, Corpus based Urdu lexicon development, in Proc. Conf. Lang. Technol. (CLT07), vol. 73, (Peshawar, Pakistan), Aug. 2007.
I. Hanif et al., Cross-language Urduenglish (clue) text alignment corpus, in Proc. Working notes CLEF (Toulouse, France), Sept. 2015.
R. Rahimi, A. Shakery, and I. King, Extracting translations from comparable corpora for cross-language information retrieval using the language modeling framework, Inf. Process. Manage, 52 (2016), no. 2, 299 -318. https://doi.org/10.1016/j.ipm.2015.08.001
M. Karthikeyan and P. Aruna, Probability based document clustering and image clustering using content-based image retrieval, Appl. Soft Comp. 13 (2013), no. 2, 959 -966. https://doi.org/10.1016/j.asoc.2012.09.013
Z. Ahmad et al., Urdu nastaleeq optical character recognition, World Acad. Sci., Eng. Technol. 26 (2007), pp. 249-252.
M. Humayoun et al., Urdu summary corpus, in Proc. Int. Conf. Lang. Resour. Eval. (Reykjavik, Iceland), May 2014, pp. 796-800, https://github.com/humsh a/USCorpus
Q. A. Akram, A. Naseer, and S. Hussain, Assasband, an affix-exception-list based Urdu stemmer, in Proc. Workshop Asian Lang. Resour. (Suntec, Singapore), Aug. 2009, pp. 40-47,
I. Rasheed and H. Banka, Query expansion in information retrieval for Urdu language, in Proc. Int. Conf. Inf. Retr. Knowl. Manag. (CAMP), (Kota Kinabalu, Malaysia), Mar. 2018, pp. 171-176.
I. Rasheed et al., Urdu text classification: A comparative study using machine learning techniques, in Proc. Int. Conf. Digit. Inf. Manag. (ICDIM) (Berlin, Germany), Sept. 2018, pp. 274-278.
K. Batri, S. Lakshmi, and B. Sathiyabhama, Trade-off between the number of index-terms and the information retrieval system's performance, Kuwait J. Sci. 44 (2017), no. 4, 49-56.
N. Craswell et al., Overview of the trec-2003 web track, in Proc. Text Retr. Conf. (TREC), vol. 3, (Gaithersburg, MD, USA), 2002.
A. AleAhmad et al., Hamshahri: A standard persian text collection, Knowl. Based Syst. 22 (2009), no. 5, 382 -387. https://doi.org/10.1016/j.knosys.2009.05.002
A. Kanapala and S. Pal, Test collection for legal ir from online discussion forums, in Proc. Forum Inf. Retr. Eval. (Bangalore, India), Dec. 2014, pp. 126-129.
J. M. Ponte and W. B. Croft, A language modeling approach to information retrieval, in Proc Int. ACM SIGIR Conf. Res. Dev. Inf Retr. (Melbourne, Australia), Aug. 1998, pp. 275-281.
I. Ounis et al., Terrier information retrieval platform, in Advances in Information Retrieval, vol. 3408, Springer, Berlin, Germany, 2005, pp. 517-519.
E. M. Voorhees, Overview of trec 2003, in Proc. Text Retr. Conf. (TREC), (Gaithersburg, MD, USA), Nov. 2003, pp. 1-13, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=150467
L. Cohen, L. Manion, and K. Morrison, The ethics of educational and social research, in Research Methods in Education, 8 th ed., Routledge, London, UK, 2013, https://doi.org/10.4324/9780203720967
S. E. Robertson et al., Okapi at trec-4, in Proc. Text REtrieval Conf. (London, UK), Oct. 1996, pp. 73-96, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.3342
G. Amati and C. J. Van Rijsbergen, Probabilistic models of information retrieval based on measuring the divergence from randomness, ACM Trans. Inf. Syst. (TOIS), 20 (2002), no. 4, 357-389. https://doi.org/10.1145/582415.582416
G. Salton, A. Wong, and C. S. Yang, A vector space model for automatic indexing, Commun. ACM 18 (1975), no. 11, 613-620. https://doi.org/10.1145/361219.361220
C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, vol. 999, MIT Press, Cambridge, MA, USA, 1999, https://nlp.stanf ord.edu/fsnlp/.
P. Clough and M. Sanderson, Evaluating the performance of information retrieval systems using test collections, Inf. Res, 18 (2013), no. 2.
W. B. Croft, D. Metzler, and T. Strohmann, Search Engines: Information Retrieval in Practice, Pearson Education, Boston, MA, USA, 2010.
A. K. McCallum, Mallet: A machine learning for language toolkit, 2002, http://mallet.cs.umass.edu/.
E. Frank et al., Weka-a machine learning workbench for data mining, in Data Mining and Knowledge Discovery Handbook, Springer, Boston, MA, USA, 2009, pp. 1269-1277.
T. Zia, M. P. Akhter, and Q. Abbas, Comparative study of feature selection approaches for Urdu text categorization, Malaysian J. Comput. Sci, 28 (2015), no. 2, 93-109.
I. Haneef et al., Design and development of a large cross-lingual plagiarism corpus for urdu-english language pair, Sci. Program. 2019 (2019), 1-11.
N. Khan, M. P. Bakht, and R. A. Wagan, Corpus construction and structure study of Urdu language using empirical laws, in Proc. Int. Conf. Data Sci. (Karachi, Pakistan), Feb. 2019, pp. 9-14.
S. Hussain, Resources for Urdu language processing, in Proc. Workshop Asian Lang. Resour. IJCNLP, (Hyderabad, India), Jan. 2008, pp. 99-100, https://www.aclweb.org/anthology/I08-7017.pdf

ETRI Journal

Building a text collection for Urdu information retrieval

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)