2024 : 12 : 22
Fateme Daneshfar

Fateme Daneshfar

Academic rank: Assistant Professor
ORCID:
Education: PhD.
ScopusId: 35078447100
HIndex:
Faculty: Faculty of Engineering
Address: Department of Computer Engineering, Faculty of Engineering, University of Kurdistan
Phone:

Research

Title
Designing and Collecting a Corpus and Syntactic Parser for Central Kurdish Language
Type
Thesis
Keywords
Treebank, Syntactic parsing, Consistency parsing, Context-free grammar, Central Kurdish, POS tagging, NLP, Chart Parsing, LLMs
Year
2024
Researchers Umed Sideeq Ahmed(Student)، Fateme Daneshfar(PrimaryAdvisor)، Hadi Veisi(PrimaryAdvisor)، Sherwan Hussein(Advisor)

Abstract

Central Kurdish, widely spoken in Iraq and Iran, lacks sufficient NLP resources. This study addresses this gap by developing the first comprehensive syntactically annotated corpus, advancing Kurdish language technologies and computational linguistics research. The creation of this Central Kurdish Corpus significantly contributes to the field of Kurdish NLP. These resources enable machine translation, information extraction, sentiment analysis, grammar checking, text summarizing, etc., and offer the potential for low-resource language processing. This work employs a systematic, multi-stage methodology. First, a diverse corpus of 3,000 carefully curated sentences is manually annotated with fine-grained POS tags, utilizing a custom tagset of 74 tags that captures intricate grammatical distinctions in Kurdish. The corpus is then syntactically annotated based on a CFG meticulously designed for Central Kurdish, encompassing 249 production rules. The corpus spans various domains, ensuring extensive coverage of syntactic phenomena. For parsing, the study implements a deterministic rule-based dynamic programming algorithm using top-down chart parsing, which leverages the developed CFG rules. This approach demonstrates robustness in handling the intricacies of Central Kurdish morphology and flexible word order. Subsequently, the research explores the application of fine-tuned cutting-edge LLMs, specifically GPT-3.5, to constituency parsing tasks. The LLMs are fine-tuned on the annotated corpus to augment parsing performance, particularly for complex and ambiguous syntactic structures. As a result, the POS tagging and rule-based parsing approaches are manually evaluated using the PARSEVAL framework. This manual evaluation reveals a POS tagging accuracy of 98.7% and a parsing accuracy of 98% for the rule-based approach on a set of 150 sentences as verified through expert review and inter-annotator agreement. The LLM-based method is assessed using the EVALB tool in this PARSEVAL evaluation scheme implementation and a standard metric for constituency parsing. This achieved 84.92% of sentences were parsed with a complete match, and the overall Bracketing F-measure reached 96.41%.