Designing and Collecting a Corpus and Syntactic Parser for Central Kurdish Language

Research

Title	Designing and Collecting a Corpus and Syntactic Parser for Central Kurdish Language
Type	Thesis
Keywords	Treebank, Syntactic parsing, Consistency parsing, Context-free grammar, Central Kurdish, POS tagging, NLP, Chart Parsing, LLMs
Year	2024
Researchers	Umed Sideeq Ahmed(Student)، Fateme Daneshfar(PrimaryAdvisor)، Hadi Veisi(PrimaryAdvisor)، Sherwan Hussein(Advisor)

Abstract

Central Kurdish, widely spoken in Iraq and Iran, lacks sufficient NLP resources. This study addresses this gap by developing the first comprehensive syntactically annotated corpus, advancing Kurdish language technologies and computational linguistics research. The creation of this Central Kurdish Corpus significantly contributes to the field of Kurdish NLP. These resources enable machine translation, information extraction, sentiment analysis, grammar checking, text summarizing, etc., and offer the potential for low-resource language processing. This work employs a systematic, multi-stage methodology. First, a diverse corpus of 3,000 carefully curated sentences is manually annotated with fine-grained POS tags, utilizing a custom tagset of 74 tags that captures intricate grammatical distinctions in Kurdish. The corpus is then syntactically annotated based on a CFG meticulously designed for Central Kurdish, encompassing 249 production rules. The corpus spans various domains, ensuring extensive coverage of syntactic phenomena. For parsing, the study implements a deterministic rule-based dynamic programming algorithm using top-down chart parsing, which leverages the developed CFG rules. This approach demonstrates robustness in handling the intricacies of Central Kurdish morphology and flexible word order. Subsequently, the research explores the application of fine-tuned cutting-edge LLMs, specifically GPT-3.5, to constituency parsing tasks. The LLMs are fine-tuned on the annotated corpus to augment parsing performance, particularly for complex and ambiguous syntactic structures. As a result, the POS tagging and rule-based parsing approaches are manually evaluated using the PARSEVAL framework. This manual evaluation reveals a POS tagging accuracy of 98.7% and a parsing accuracy of 98% for the rule-based approach on a set of 150 sentences as verified through expert review and inter-annotator agreement. The LLM-based method is assessed using the EVALB tool in this PARSEVAL evaluation scheme implementation and a standard metric for constituency parsing. This achieved 84.92% of sentences were parsed with a complete match, and the overall Bracketing F-measure reached 96.41%.

Fateme Daneshfar

Research

Abstract