elner-dz

ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic Dialect (Darija)

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary

Keywords

algerian-dialect arabic arabizi arabizi-arabic darija darija-nlp entity-linking named-entity-recognition named-entity-recognition-dataset nlp-dataset wikidata

Last synced: 10 months ago · JSON representation ·

Repository

ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic Dialect (Darija)

Basic Info

Host: GitHub
Owner: hanine-bgt
License: other
Default Branch: main
Homepage:
Size: 22.7 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Topics

algerian-dialect arabic arabizi arabizi-arabic darija darija-nlp entity-linking named-entity-recognition named-entity-recognition-dataset nlp-dataset wikidata

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

ELNER-DZ: Algerian Arabic Dataset for Named Entity Recognition and Entity Linking

📌 Official DOI:
🤗 Also available on Hugging Face
📥 Download Dataset (.rar from Zenodo)

This dataset, titled ELNER-DZ, was created by Bouguettoucha Hadjer Hanine and Djouablia Ilhem as part of our Master’s thesis. It is the first large-scale dataset designed for Named Entity Recognition (NER) and Entity Linking (EL) in Algerian Arabic Dialect (Darija), including both Arabic script and Arabizi (Latin-script).

This dataset contains over 2 million dialectal sentences labeled with more than 1.9 million named entities and linked to Wikidata QIDs.

🧾 Dataset Summary

Name: ELNER-DZ
Languages: Arabic (ar for MSA, arq for dialectal), Arabizi (Latin), French (fr), English (en)
Script: Arabic and Latin (Arabizi)
Format: JSON (compressed in data.rar)
Annotations:
- Named Entity spans (start, end)
- NER labels (PER, LOC, ORG, etc.)
- Normalized forms
- Wikidata QIDs

📁 File Structure

data/data.rar — Compressed archive containing data.json
examples/loading_example.py — Script to extract and load the dataset
LICENSE — CC-BY-4.0
dataset_card.md — Hugging Face dataset summary

✨ Example Format

json { "id": 188, "text": "3reft wa7ed lperson khadem f Yassir", "entities": [ { "start": 29, "end": 35, "label": "ORG", "wikidata_id": "Q117156470", "normalized": "Yassir" } ] }`

🏷️ Entity Types

PER: Person
LOC: Location
ORG: Organization
PROD: Product
LAW: Legal texts or rules
LANG: Language
EVENT: Events
DATE: Temporal expressions
NORP: Nationality/Religious/Political groups
SPORT: Sports & Competitions
SYMPTOM, DISEASE: Medical categories
MISC: Miscellaneous

🧪 Tasks Supported

Named Entity Recognition (NER)
Entity Linking (EL) with Wikidata
Dialectal NLP in Algerian Arabic
Code-switching and multiscript modeling
Low-resource transfer learning

🧰 How to Use

▶️ Requirements

bash pip install datasets rarfile sudo apt-get install unrar # For Linux

▶️ Run the loading script

bash python examples/loading_example.py

Or manually extract and load:

```python import rarfile rf = rarfile.RarFile("data/data.rar") rf.extractall("data/")

from datasets import loaddataset dataset = loaddataset("json", data_files="data/data.json", split="train") print(dataset[0]) ```

🔍 Dataset Details

Source: Social media, dialogues, e-commerce, Wikidata SPARQL
Annotation:
- Semi-automated and rule-based extraction
- Manual normalization of entity surface forms
- Wikidata QID linking via SPARQL and fallback search

👩‍💻 Authors

Bouguettoucha Hadjer Hanine
Djouablia Ilhem

📄 License

This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

🔗 View Full License

📚 Citation

bibtex @dataset{bouguettoucha_djouablia_2025, author = {Bouguettoucha, Hadjer Hanine and Djouablia, Ilhem}, title = {ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic}, year = 2025, publisher = {Zenodo}, doi = {10.5281/zenodo.15798592}, url = {https://doi.org/10.5281/zenodo.15798592} }

Owner

Login: hanine-bgt
Kind: user

Repositories: 1
Profile: https://github.com/hanine-bgt

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite us using the metadata below:"
title: "ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic"
version: "1.0"
doi: 10.5281/zenodo.15798592
authors:
  - family-names: Bouguettoucha
    given-names: Hadjer Hanine
   
  - family-names: Djouablia
    given-names: Ilhem
   
date-released: 2025-07-01
keywords:
  - named entity recognition
  - entity linking
  - Algerian Arabic
  - Arabizi
  - dialectal Arabic
  - NLP
  - Wikidata
repository-code: https://github.com/hanine-bgt/elner-dz
license: CC-BY-4.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science