elner-dz

ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic Dialect (Darija)

https://github.com/hanine-bgt/elner-dz

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

algerian-dialect arabic arabizi arabizi-arabic darija darija-nlp entity-linking named-entity-recognition named-entity-recognition-dataset nlp-dataset wikidata
Last synced: 6 months ago · JSON representation ·

Repository

ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic Dialect (Darija)

Basic Info
  • Host: GitHub
  • Owner: hanine-bgt
  • License: other
  • Default Branch: main
  • Homepage:
  • Size: 22.7 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
algerian-dialect arabic arabizi arabizi-arabic darija darija-nlp entity-linking named-entity-recognition named-entity-recognition-dataset nlp-dataset wikidata
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

ELNER-DZ: Algerian Arabic Dataset for Named Entity Recognition and Entity Linking

📌 Official DOI: DOI
🤗 Also available on Hugging Face
📥 Download Dataset (.rar from Zenodo)


This dataset, titled ELNER-DZ, was created by Bouguettoucha Hadjer Hanine and Djouablia Ilhem as part of our Master’s thesis. It is the first large-scale dataset designed for Named Entity Recognition (NER) and Entity Linking (EL) in Algerian Arabic Dialect (Darija), including both Arabic script and Arabizi (Latin-script).

This dataset contains over 2 million dialectal sentences labeled with more than 1.9 million named entities and linked to Wikidata QIDs.


🧾 Dataset Summary

  • Name: ELNER-DZ
  • Languages: Arabic (ar for MSA, arq for dialectal), Arabizi (Latin), French (fr), English (en)
  • Script: Arabic and Latin (Arabizi)
  • Format: JSON (compressed in data.rar)
  • Annotations:
    • Named Entity spans (start, end)
    • NER labels (PER, LOC, ORG, etc.)
    • Normalized forms
    • Wikidata QIDs

📁 File Structure

  • data/data.rar — Compressed archive containing data.json
  • examples/loading_example.py — Script to extract and load the dataset
  • LICENSE — CC-BY-4.0
  • dataset_card.md — Hugging Face dataset summary

✨ Example Format

json { "id": 188, "text": "3reft wa7ed lperson khadem f Yassir", "entities": [ { "start": 29, "end": 35, "label": "ORG", "wikidata_id": "Q117156470", "normalized": "Yassir" } ] } `


🏷️ Entity Types

  • PER: Person
  • LOC: Location
  • ORG: Organization
  • PROD: Product
  • LAW: Legal texts or rules
  • LANG: Language
  • EVENT: Events
  • DATE: Temporal expressions
  • NORP: Nationality/Religious/Political groups
  • SPORT: Sports & Competitions
  • SYMPTOM, DISEASE: Medical categories
  • MISC: Miscellaneous

🧪 Tasks Supported

  • Named Entity Recognition (NER)
  • Entity Linking (EL) with Wikidata
  • Dialectal NLP in Algerian Arabic
  • Code-switching and multiscript modeling
  • Low-resource transfer learning

🧰 How to Use

▶️ Requirements

bash pip install datasets rarfile sudo apt-get install unrar # For Linux

▶️ Run the loading script

bash python examples/loading_example.py

Or manually extract and load:

```python import rarfile rf = rarfile.RarFile("data/data.rar") rf.extractall("data/")

from datasets import loaddataset dataset = loaddataset("json", data_files="data/data.json", split="train") print(dataset[0]) ```


🔍 Dataset Details

  • Source: Social media, dialogues, e-commerce, Wikidata SPARQL
  • Annotation:

    • Semi-automated and rule-based extraction
    • Manual normalization of entity surface forms
    • Wikidata QID linking via SPARQL and fallback search

👩‍💻 Authors

  • Bouguettoucha Hadjer Hanine
  • Djouablia Ilhem

📄 License

This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

🔗 View Full License


📚 Citation

bibtex @dataset{bouguettoucha_djouablia_2025, author = {Bouguettoucha, Hadjer Hanine and Djouablia, Ilhem}, title = {ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic}, year = 2025, publisher = {Zenodo}, doi = {10.5281/zenodo.15798592}, url = {https://doi.org/10.5281/zenodo.15798592} }

Owner

  • Login: hanine-bgt
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite us using the metadata below:"
title: "ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic"
version: "1.0"
doi: 10.5281/zenodo.15798592
authors:
  - family-names: Bouguettoucha
    given-names: Hadjer Hanine
   
  - family-names: Djouablia
    given-names: Ilhem
   
date-released: 2025-07-01
keywords:
  - named entity recognition
  - entity linking
  - Algerian Arabic
  - Arabizi
  - dialectal Arabic
  - NLP
  - Wikidata
repository-code: https://github.com/hanine-bgt/elner-dz
license: CC-BY-4.0

GitHub Events

Total
  • Push event: 5
Last Year
  • Push event: 5