https://github.com/camel-lab/wild_diacritics

Wild Diacritics paper repo.

https://github.com/camel-lab/wild_diacritics

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Wild Diacritics paper repo.

Basic Info
  • Host: GitHub
  • Owner: CAMeL-Lab
  • License: cc-by-sa-4.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 84.3 MB
Statistics
  • Stars: 0
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

Wild Diacritics

About

This repo contains code and data relating to the 'Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization' paper published in the proceedings of ACL 2024.

Data

The files for the Wild2Max and WikiNewsMax datasets can all be found in the data directory.

If you just need the datasets, you can find zipped versions of the datasets in the releases page.

Code

You can find the helping scripts used to generate all the numbers in the paper in the wilddiacs_utils directory.

You can find all the evaluation scripts relating to the 'Exploiting Diacritics in the Wild' section of the paper in the exploiting_wilddiacs directory.

A fork of CAMeL Tools with the Wild Diacritics edits outlined in the paper can be found in the ct_wilddiac repo.

License

The Wild2Max and WikiNewsMax datasets are available under the Creative Commons Attribution-ShareAlike License. See LICENSECCBY_SA for more info.

All scripts and code in this repo are available under the MIT license. See LICENSE_MIT for more info.

Citing

If you find any of our work useful or publish work using the Wild2Max or WikiNewsMax datasets, please cite our paper:

bibtex @misc{elgamal2024arabicdiacriticswildexploiting, title={Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization}, author={Salman Elgamal and Ossama Obeid and Tameem Kabbani and Go Inoue and Nizar Habash}, year={2024}, eprint={2406.05760}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.05760}, }

If you publish work using the WikiNewsMax dataset, please additionally cite the paper describing the original WikiNews dataset:

bibtex @inproceedings{darwish-etal-2017-arabic, title = "{A}rabic Diacritization: Stats, Rules, and Hacks", author = "Darwish, Kareem and Mubarak, Hamdy and Abdelali, Ahmed", editor = "Habash, Nizar and Diab, Mona and Darwish, Kareem and El-Hajj, Wassim and Al-Khalifa, Hend and Bouamor, Houda and Tomeh, Nadi and El-Haj, Mahmoud and Zaghouani, Wajdi", booktitle = "Proceedings of the Third {A}rabic Natural Language Processing Workshop", month = apr, year = "2017", address = "Valencia, Spain", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W17-1302", doi = "10.18653/v1/W17-1302", pages = "9--17", abstract = "In this paper, we present a new and fast state-of-the-art Arabic diacritizer that guesses the diacritics of words and then their case endings. We employ a Viterbi decoder at word-level with back-off to stem, morphological patterns, and transliteration and sequence labeling based diacritization of named entities. For case endings, we use Support Vector Machine (SVM) based ranking coupled with morphological patterns and linguistic rules to properly guess case endings. We achieve a low word level diacritization error of 3.29{\%} and 12.77{\%} without and with case endings respectively on a new multi-genre free of copyright test set. We are making the diacritizer available for free for research purposes.", }

Owner

  • Name: CAMeL Lab
  • Login: CAMeL-Lab
  • Kind: organization
  • Location: Abu Dhabi, UAE

The Computational Approaches to Modeling Language (CAMeL) Lab at New York University Abu Dhabi

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2