elner-dz
ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic Dialect (Darija)
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Keywords
Repository
ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic Dialect (Darija)
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
ELNER-DZ: Algerian Arabic Dataset for Named Entity Recognition and Entity Linking
📌 Official DOI:
🤗 Also available on Hugging Face
📥 Download Dataset (.rar from Zenodo)
This dataset, titled ELNER-DZ, was created by Bouguettoucha Hadjer Hanine and Djouablia Ilhem as part of our Master’s thesis. It is the first large-scale dataset designed for Named Entity Recognition (NER) and Entity Linking (EL) in Algerian Arabic Dialect (Darija), including both Arabic script and Arabizi (Latin-script).
This dataset contains over 2 million dialectal sentences labeled with more than 1.9 million named entities and linked to Wikidata QIDs.
🧾 Dataset Summary
- Name: ELNER-DZ
- Languages: Arabic (
arfor MSA,arqfor dialectal), Arabizi (Latin), French (fr), English (en) - Script: Arabic and Latin (Arabizi)
- Format: JSON (compressed in
data.rar) - Annotations:
- Named Entity spans (start, end)
- NER labels (PER, LOC, ORG, etc.)
- Normalized forms
- Wikidata QIDs
📁 File Structure
data/data.rar— Compressed archive containingdata.jsonexamples/loading_example.py— Script to extract and load the datasetLICENSE— CC-BY-4.0dataset_card.md— Hugging Face dataset summary
✨ Example Format
json
{
"id": 188,
"text": "3reft wa7ed lperson khadem f Yassir",
"entities": [
{
"start": 29,
"end": 35,
"label": "ORG",
"wikidata_id": "Q117156470",
"normalized": "Yassir"
}
]
}
`
🏷️ Entity Types
PER: PersonLOC: LocationORG: OrganizationPROD: ProductLAW: Legal texts or rulesLANG: LanguageEVENT: EventsDATE: Temporal expressionsNORP: Nationality/Religious/Political groupsSPORT: Sports & CompetitionsSYMPTOM,DISEASE: Medical categoriesMISC: Miscellaneous
🧪 Tasks Supported
- Named Entity Recognition (NER)
- Entity Linking (EL) with Wikidata
- Dialectal NLP in Algerian Arabic
- Code-switching and multiscript modeling
- Low-resource transfer learning
🧰 How to Use
▶️ Requirements
bash
pip install datasets rarfile
sudo apt-get install unrar # For Linux
▶️ Run the loading script
bash
python examples/loading_example.py
Or manually extract and load:
```python import rarfile rf = rarfile.RarFile("data/data.rar") rf.extractall("data/")
from datasets import loaddataset dataset = loaddataset("json", data_files="data/data.json", split="train") print(dataset[0]) ```
🔍 Dataset Details
- Source: Social media, dialogues, e-commerce, Wikidata SPARQL
Annotation:
- Semi-automated and rule-based extraction
- Manual normalization of entity surface forms
- Wikidata QID linking via SPARQL and fallback search
👩💻 Authors
- Bouguettoucha Hadjer Hanine
- Djouablia Ilhem
📄 License
This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
📚 Citation
bibtex
@dataset{bouguettoucha_djouablia_2025,
author = {Bouguettoucha, Hadjer Hanine and Djouablia, Ilhem},
title = {ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic},
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.15798592},
url = {https://doi.org/10.5281/zenodo.15798592}
}
Owner
- Login: hanine-bgt
- Kind: user
- Repositories: 1
- Profile: https://github.com/hanine-bgt
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this dataset, please cite us using the metadata below:"
title: "ELNER-DZ: A Dataset for Named Entity Recognition and Linking in Algerian Arabic"
version: "1.0"
doi: 10.5281/zenodo.15798592
authors:
- family-names: Bouguettoucha
given-names: Hadjer Hanine
- family-names: Djouablia
given-names: Ilhem
date-released: 2025-07-01
keywords:
- named entity recognition
- entity linking
- Algerian Arabic
- Arabizi
- dialectal Arabic
- NLP
- Wikidata
repository-code: https://github.com/hanine-bgt/elner-dz
license: CC-BY-4.0
GitHub Events
Total
- Push event: 5
Last Year
- Push event: 5