Unknown

Dataset Information

0

DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect.


ABSTRACT: DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic.

SUBMITTER: Moussa HN 

PROVIDER: S-EPMC10293988 | biostudies-literature | 2023 Jun

REPOSITORIES: biostudies-literature

altmetric image

Publications

DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect.

Moussa Hanane Nour HN   Mourhir Asmaa A  

Data in brief 20230512


DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic na  ...[more]

Similar Datasets

| S-EPMC10293979 | biostudies-literature
| S-EPMC10884741 | biostudies-literature
| S-EPMC8528865 | biostudies-literature
| S-EPMC6956779 | biostudies-literature
| S-EPMC11373323 | biostudies-literature
| S-EPMC3066171 | biostudies-literature
| S-EPMC11622873 | biostudies-literature
| S-EPMC8242017 | biostudies-literature
| S-EPMC6247938 | biostudies-literature
| S-EPMC7485218 | biostudies-literature