DarBERT: A Moroccan Arabic Language Model
Disciplines
Computer Sciences
Abstract (300 words maximum)
The rise of unstructured text data in various languages presents both challenges and opportunities in the field of natural language processing. However, underrepresented languages, such as the Moroccan Arabic dialect (Darija), often lack comprehensive tools and resources for efficient information extraction, preventing the utilization of valuable textual data sources. This project is motivated by the need to bridge this gap, enhancing the accessibility of textual data in Darija, contributing to the diversification and inclusivity in the global digital information landscape.
The objective of this study is to develop a Named Entity Recognition (NER) model tailored for the Moroccan dialect. Given the non-existence of standardized NER models for Darija, extracting meaningful information such as names, locations, and organizations from unstructured text remains a significant challenge.
This project uses DarNERcorp, a manually annotated corpus containing over 65K tokens, with named entities tagged according to the BIO tagging scheme and aims to solve the problem of entity recognition and classification in Darija texts, enabling efficient information extraction and text analytics.
Using its ability to learn and represent sequential data well, a Bi-directional Long Short-Term Memory (BiLSTM) model is used to capture the contextual dependencies prevalent in natural language. An 80-20 split of the DarNERcorp dataset is used to train the model, guaranteeing a thorough learning phase and a reliable assessment to determine the model's generalizability and performance.
Academic department under which the project should be listed
CCSE - Computer Science
Primary Investigator (PI) Name
Md Abdullah Al Hafiz Khan
DarBERT: A Moroccan Arabic Language Model
The rise of unstructured text data in various languages presents both challenges and opportunities in the field of natural language processing. However, underrepresented languages, such as the Moroccan Arabic dialect (Darija), often lack comprehensive tools and resources for efficient information extraction, preventing the utilization of valuable textual data sources. This project is motivated by the need to bridge this gap, enhancing the accessibility of textual data in Darija, contributing to the diversification and inclusivity in the global digital information landscape.
The objective of this study is to develop a Named Entity Recognition (NER) model tailored for the Moroccan dialect. Given the non-existence of standardized NER models for Darija, extracting meaningful information such as names, locations, and organizations from unstructured text remains a significant challenge.
This project uses DarNERcorp, a manually annotated corpus containing over 65K tokens, with named entities tagged according to the BIO tagging scheme and aims to solve the problem of entity recognition and classification in Darija texts, enabling efficient information extraction and text analytics.
Using its ability to learn and represent sequential data well, a Bi-directional Long Short-Term Memory (BiLSTM) model is used to capture the contextual dependencies prevalent in natural language. An 80-20 split of the DarNERcorp dataset is used to train the model, guaranteeing a thorough learning phase and a reliable assessment to determine the model's generalizability and performance.