2024 : 11 : 21
Ebrahim Badakhshan

Ebrahim Badakhshan

Academic rank: Associate Professor
ORCID:
Education: PhD.
ScopusId: 57105501200
HIndex:
Faculty: Faculty of Language and Literature
Address: Department of English literature and Linguistics, Faculty of Literature and Languages, University of Kurdistan
Phone:

Research

Title
Kurdish Corpus Project
Type
Presentation
Keywords
Kurdish language, Sorani, Kurdish corpus, Kurdish phonetics, Kurdish Vocabulary
Year
2017
Researchers Ebrahim Badakhshan

Abstract

The present paper tries to introduce Kurdish Corpus Project currently under construction in the University of Kurdistan, Sanandaj, Iran and the problems associated with it. This project initiated two years ago and the first phase has been completed. This is the first Kurdish corpus available online. Sorani dialect is chosen because of its centrality and that it is spoken by the majority of Kurdish population in Iran and Iraq. The texts used in creating this website is mostly from news websites like Kurdpress. The corpus at this stage consists of 69000 news documents including 14,898,062 words which consists of 436,655 tokens from a variety of genres. Forty documents have been tagged syntactically with utmost deliberation. No stemmer has yet been used on the present corpus so words like “کوردی” and “کورد” are counted as two words. Kurdish language has a plethora of dialects and writing systems and this makes building the corpus a difficult job. Lack of standardization, the diversity of the dialects of Kurdish and some structural properties of Kurdish Language also make it hard for tagging the words syntactically. There are also phonetic peculiarities in this language which makes phonetic tagging problematic too. Absence of a solid OCR software for Kurdish language to change pdf files into text files to be machine readable is another major problem, all such texts have to be typed which is very time consuming and economically not viable. Despite all these problems, lack of resources and more, the corpus team in Kurdistan University are determined to continue doing the project.