https://github.com/camel-lab/arabic-gec

Code, models, and data for "Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation". EMNLP 2023.

https://github.com/camel-lab/arabic-gec

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary

Keywords

arabic arabic-nlp deep-learning gec ged grammatical-error-correction grammatical-error-detection nlp
Last synced: 10 months ago · JSON representation

Repository

Code, models, and data for "Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation". EMNLP 2023.

Basic Info
Statistics
  • Stars: 13
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 1
Topics
arabic arabic-nlp deep-learning gec ged grammatical-error-correction grammatical-error-detection nlp
Created over 3 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

Arabic Grammatical Error Detection and Correction

This repo contains code and pretrained models to reproduce the results in our paper Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation.

Requirements:

The code was written for python>=3.9, pytorch 1.11.1, and a modified version of transformers 4.22.2. You will need a few additional packages. Here's how you can set up the environment using conda (assuming you have conda and cuda installed):

```bash git clone https://github.com/CAMeL-Lab/arabic-gec.git cd arabic-gec

conda create -n gec python=3.9 conda activate gec

pip install -r requirements.txt ```

To obtain multi-class grammatical error detection (GED) labels, we use an enhanced version of ARETA. To avoid dependency conflicts and to ensure reproducibility, you would need to create a separate environment to run ARETA:

```bash cd areta

conda create -n areta python=3.7 pip install -r requirements.txt ```

Experiments and Reproducibility:

This repo is organized as follows: 1. data: includes all the data we used throughout our paper to train and test various systems. This includes alignments, m2edits, GED, GEC, and all the utilities we used. 2. ged: includes the scripts needed to train and evaluate our GED models. 3. gec: includes the scripts needed to train and evaluate our GEC models. 4. alignment: a stand-alone script for the alignment algorithm we introduced in our paper. 5. areta: a slightly improved version of ARETA, an error type annotation tool for Modern Standard Arabic (MSA). 6. transformers: our extended version of Hugging Face's transformers that allows incorporating GED information with seq2seq models.

Hugging Face Integration:

We make our GED and GEC models publicly available on Hugging Face.

GED:

```python

from transformers import pipeline ged = pipeline('token-classification', model='CAMeL-Lab/camelbert-msa-qalb14-ged-13') text = ' ' predictions = ged(text) print(predictions)

"""[{'entity': 'MERGE-B', 'score': 0.99943775, 'index': 1, 'word': '', 'start': 0, 'end': 1}, {'entity': 'MERGE-I', 'score': 0.99959165, 'index': 2, 'word': '', 'start': 2, 'end': 5}, {'entity': 'UC', 'score': 0.9985884, 'index': 3, 'word': '', 'start': 6, 'end': 8}, {'entity': 'REPLACEO', 'score': 0.8346316, 'index': 4, 'word': '', 'start': 9, 'end': 12}, {'entity': 'UC', 'score': 0.99985325, 'index': 5, 'word': '', 'start': 13, 'end': 16}, {'entity': 'REPLACEO', 'score': 0.6836415, 'index': 6, 'word': '', 'start': 17, 'end': 20}, {'entity': 'UC', 'score': 0.99763715, 'index': 7, 'word': '', 'start': 21, 'end': 27}, {'entity': 'REPLACE_O', 'score': 0.993848, 'index': 8, 'word': '', 'start': 28, 'end': 33}]""" ```

GEC:

```python

from transformers import AutoTokenizer, BertForTokenClassification, MBartForConditionalGeneration from cameltools.disambig.bert import BERTUnfactoredDisambiguator from cameltools.utils.dediac import dediac_ar import torch.nn.functional as F import torch

bert_disambig = BERTUnfactoredDisambiguator.pretrained()

gedtokenizer = AutoTokenizer.frompretrained('CAMeL-Lab/camelbert-msa-qalb14-ged-13') gedmodel = BertForTokenClassification.frompretrained('CAMeL-Lab/camelbert-msa-qalb14-ged-13')

gectokenizer = AutoTokenizer.frompretrained('CAMeL-Lab/arabart-qalb14-gec-ged-13') gecmodel = MBartForConditionalGeneration.frompretrained('CAMeL-Lab/arabart-qalb14-gec-ged-13')

text = ' .'

morph processing the input text

textdisambig = bertdisambig.disambiguate(text.split()) morphpptext = [dediacar(wdisambig.analyses[0].analysis['diac']) for wdisambig in textdisambig] morphpptext = ' '.join(morphpptext)

GED tagging

inputs = gedtokenizer([morphpptext], returntensors='pt') logits = gedmodel(**inputs).logits preds = F.softmax(logits, dim=-1).squeeze()[1:-1] predgedlabels = [gedmodel.config.id2label[p.item()] for p in torch.argmax(preds, -1)]

Extending GED label to GEC-tokenized input

gedlabel2ids = gecmodel.config.gedlabel2id tokens, gedlabels = [], []

for word, label in zip(morphpptext.split(), predgedlabels): wordtokens = gectokenizer.tokenize(word) if len(wordtokens) > 0: tokens.extend(wordtokens) gedlabels.extend([label for _ in range(len(wordtokens))])

inputids = gectokenizer.converttokenstoids(tokens) inputids = [gectokenizer.bostokenid] + inputids + [gectokenizer.eostoken_id]

labelids = [gedlabel2ids.get(label, gedlabel2ids['']) for label in gedlabels] labelids = [gedlabel2ids['UC']] + labelids + [gedlabel2ids['UC']] attentionmask = [1 for _ in range(len(inputids))]

genkwargs = {'numbeams': 5, 'maxlength': 100, 'numreturnsequences': 1, 'norepeatngramsize': 0, 'earlystopping': False, 'gedtags': torch.tensor([labelids]), 'attentionmask': torch.tensor([attention_mask]) }

GEC generation

generated = gecmodel.generate(torch.tensor([inputids]), **genkwargs) generatedtext = gectokenizer.batchdecode(generated, skipspecialtokens=True, cleanuptokenizationspaces=False )[0] print(generatedtext) """ .""" ```

License:

This repo is available under the MIT license. See the LICENSE for more info.

Citation:

If you find the code or data in this repo helpful, please cite our paper:

```bibtex @inproceedings{alhafni-etal-2023-advancements, title = "Advancements in {A}rabic Grammatical Error Detection and Correction: An Empirical Investigation", author = "Alhafni, Bashar and Inoue, Go and Khairallah, Christian and Habash, Nizar", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.396", pages = "6430--6448", abstract = "Grammatical error correction (GEC) is a well-explored problem in English with many existing models and datasets. However, research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. In this paper, we present the first results on Arabic GEC using two newly developed Transformer-based pretrained sequence-to-sequence models. We also define the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED. We show that using GED information as auxiliary input in GEC models improves GEC performance across three datasets spanning different genres. Moreover, we also investigate the use of contextual morphological preprocessing in aiding GEC systems. Our models achieve SOTA results on two Arabic GEC shared task datasets and establish a strong benchmark on a recently created dataset. We make our code, data, and pretrained models publicly available.", }

Owner

  • Name: CAMeL Lab
  • Login: CAMeL-Lab
  • Kind: organization
  • Location: Abu Dhabi, UAE

The Computational Approaches to Modeling Language (CAMeL) Lab at New York University Abu Dhabi

GitHub Events

Total
  • Issues event: 2
  • Watch event: 5
  • Issue comment event: 4
  • Fork event: 1
Last Year
  • Issues event: 2
  • Watch event: 5
  • Issue comment event: 4
  • Fork event: 1

Dependencies

alignment/evaluate/ced_word_alignment/requirements.txt pypi
  • docopt *
  • rapidfuzz *
areta/aligner/requirements.txt pypi
  • docopt ==0.6.2
  • editdistance ==0.5.3
areta/requirements.txt pypi
  • PyArabic ==0.6.10
  • cachetools ==4.1.1
  • chardet ==3.0.4
  • convert-numbers ==0.4
  • docopt ==0.6.2
  • editdistance ==0.5.3
  • edlib ==1.3.8.post2
  • joblib ==0.17.0
  • matplotlib ==3.3.3
  • nltk ==3.6.6
  • numpy ==1.21.0
  • openpyxl ==3.0.5
  • pandas ==1.1.4
  • pandas-ml ==0.6.1
  • prettytable ==0.7.2
  • pytest ==6.1.1
  • python-Levenshtein ==0.12.0
  • python-editor ==1.0.4
  • regex ==2022.3.2
  • scikit-learn ==0.23.2
  • scipy ==1.5.2
  • tqdm ==4.50.2
  • xlrd ==2.0.1