https://github.com/camel-lab/wild_diacritics
Wild Diacritics paper repo.
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary
Repository
Wild Diacritics paper repo.
Basic Info
Statistics
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Wild Diacritics
About
This repo contains code and data relating to the 'Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization' paper published in the proceedings of ACL 2024.
Data
The files for the Wild2Max and WikiNewsMax datasets can all be found in the data directory.
If you just need the datasets, you can find zipped versions of the datasets in the releases page.
Code
You can find the helping scripts used to generate all the numbers in the paper in the wilddiacs_utils directory.
You can find all the evaluation scripts relating to the 'Exploiting Diacritics in the Wild' section of the paper in the exploiting_wilddiacs directory.
A fork of CAMeL Tools with the Wild Diacritics edits outlined in the paper can be found in the ct_wilddiac repo.
License
The Wild2Max and WikiNewsMax datasets are available under the Creative Commons Attribution-ShareAlike License. See LICENSECCBY_SA for more info.
All scripts and code in this repo are available under the MIT license. See LICENSE_MIT for more info.
Citing
If you find any of our work useful or publish work using the Wild2Max or WikiNewsMax datasets, please cite our paper:
bibtex
@misc{elgamal2024arabicdiacriticswildexploiting,
title={Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization},
author={Salman Elgamal and Ossama Obeid and Tameem Kabbani and Go Inoue and Nizar Habash},
year={2024},
eprint={2406.05760},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.05760},
}
If you publish work using the WikiNewsMax dataset, please additionally cite the paper describing the original WikiNews dataset:
bibtex
@inproceedings{darwish-etal-2017-arabic,
title = "{A}rabic Diacritization: Stats, Rules, and Hacks",
author = "Darwish, Kareem and
Mubarak, Hamdy and
Abdelali, Ahmed",
editor = "Habash, Nizar and
Diab, Mona and
Darwish, Kareem and
El-Hajj, Wassim and
Al-Khalifa, Hend and
Bouamor, Houda and
Tomeh, Nadi and
El-Haj, Mahmoud and
Zaghouani, Wajdi",
booktitle = "Proceedings of the Third {A}rabic Natural Language Processing Workshop",
month = apr,
year = "2017",
address = "Valencia, Spain",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W17-1302",
doi = "10.18653/v1/W17-1302",
pages = "9--17",
abstract = "In this paper, we present a new and fast state-of-the-art Arabic diacritizer that guesses the diacritics of words and then their case endings. We employ a Viterbi decoder at word-level with back-off to stem, morphological patterns, and transliteration and sequence labeling based diacritization of named entities. For case endings, we use Support Vector Machine (SVM) based ranking coupled with morphological patterns and linguistic rules to properly guess case endings. We achieve a low word level diacritization error of 3.29{\%} and 12.77{\%} without and with case endings respectively on a new multi-genre free of copyright test set. We are making the diacritizer available for free for research purposes.",
}
Owner
- Name: CAMeL Lab
- Login: CAMeL-Lab
- Kind: organization
- Location: Abu Dhabi, UAE
- Website: http://camel-lab.com
- Repositories: 22
- Profile: https://github.com/CAMeL-Lab
The Computational Approaches to Modeling Language (CAMeL) Lab at New York University Abu Dhabi
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2