Kurdish Corpus Project, By Ebrahim Badakhshan

Research

Title	Kurdish Corpus Project
Type	Presentation
Keywords	Kurdish language, Sorani, Kurdish corpus, Kurdish phonetics, Kurdish Vocabulary
Year	2017
Researchers	Ebrahim Badakhshan

Abstract

The present paper tries to introduce Kurdish Corpus Project currently under construction in the University of Kurdistan, Sanandaj, Iran and the problems associated with it. This project initiated two years ago and the first phase has been completed. This is the first Kurdish corpus available online. Sorani dialect is chosen because of its centrality and that it is spoken by the majority of Kurdish population in Iran and Iraq. The texts used in creating this website is mostly from news websites like Kurdpress. The corpus at this stage consists of 69000 news documents including 14,898,062 words which consists of 436,655 tokens from a variety of genres. Forty documents have been tagged syntactically with utmost deliberation. No stemmer has yet been used on the present corpus so words like “کوردی” and “کورد” are counted as two words. Kurdish language has a plethora of dialects and writing systems and this makes building the corpus a difficult job. Lack of standardization, the diversity of the dialects of Kurdish and some structural properties of Kurdish Language also make it hard for tagging the words syntactically. There are also phonetic peculiarities in this language which makes phonetic tagging problematic too. Absence of a solid OCR software for Kurdish language to change pdf files into text files to be machine readable is another major problem, all such texts have to be typed which is very time consuming and economically not viable. Despite all these problems, lack of resources and more, the corpus team in Kurdistan University are determined to continue doing the project.