https://github.com/ai-forever/sage

SAGE: Spelling correction, corruption and evaluation for multiple languages

https://github.com/ai-forever/sage

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

SAGE: Spelling correction, corruption and evaluation for multiple languages

Basic Info
Statistics
  • Stars: 153
  • Watchers: 11
  • Forks: 10
  • Open Issues: 8
  • Releases: 2
Created almost 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

SAGE

License Release Paper Documentation Status

Spelling correction, corruption and evaluation for multiple languages

Install | Models | Evaluation | SBSC | Augmentex | Papers

SAGE (Spell checking via Augmentation and Generative distribution Emulation) is a complete solution that you need when working on a spelling problem:

You can test them out right here Try Model Generation In Colab! - 🧩 Augment your data with spelling corruption algorithms, take a look at a quick demo Try Model Generation In Colab! - 📊 Evaluate performance of spelling correction tools.

News

🔥 [2024-01-18]: Our paper "A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages" is accepted for EACL 2024 conference!

💥 [2024-04-11]: SAGE v1.1.0 is finally out: a comprehensive note about the details of release can be found here.

Table of contents

Installation

Regular install

commandline git clone https://github.com/ai-forever/sage.git cd sage pip install .

To install extra requirements that you are going to need when working with ERRANT-based metric run commandline pip install -e ".[errant]" or just commandline pip install -e .[errant]

Editable install

commandline git clone https://github.com/ai-forever/sage.git cd sage pip install -e . and proceed with extra requirements install as above.

Quick demo

Lets spoil some text: ```python import sage from sage.spelling_corruption import SBSCConfig, SBSCCorruptor from sage.utils import DatasetsAvailable

text = "Заметьте, не я это предложил!"

Instantiate SBSC corruptor from a dataset with errors in medical anamnesis

config = SBSCConfig( referencedatasetnameorpath=DatasetsAvailable.MedSpellchecker.name, referencedatasetsplit="test" ) corruptor = SBSCCorruptor.from_config(config)

corruptor.corrupt(text, seed=1)

'Заиетьте, не я эт о пред ложил!'

``` ... now with Augmentex:

```python import sage from sage.spelling_corruption import WordAugConfig, WordAugCorruptor

text = "Заметьте, не я это предложил!"

Instantiate WordAugCorruptor corruptor with a custom set of parameters

config = WordAugConfig( minaug=1, maxaug=5, unitprob=0.4, ) corruptor = WordAugCorruptor.fromconfig(config)

corruptor.corrupt(text, seed=1)

'это не предложил! Заметьте, я'

```

... or for the English language:

```python import os from sage.spelling_corruption import SBSCConfig, SBSCCorruptor

text = "Screw you guys, I am going home. (c)"

Instantiate SBSC corruptor from a JFLEG dataset

config = SBSCConfig( lang="en", referencedatasetnameorpath=os.path.join("data", "exampledata", "jfleg"), ) corruptor = SBSCCorruptor.fromconfig(config)

corruptor.corrupt(text, seed=1)

'Screw you kuys, I am going home. (c)'

```

Now we can use our models to restore the initial text back: ```python from sage.spellingcorrection import AvailableCorrectors from sage.spellingcorrection import RuM2M100ModelForSpellingCorrection, T5ModelForSpellingCorruption

textru = "Замтьте не я это предложил" texten = "Screw you kuys, I am going home. (c)"

correctorfred = T5ModelForSpellingCorruption.frompretrained(AvailableCorrectors.sagefredt5large.value) correctorm2m = RuM2M100ModelForSpellingCorrection.frompretrained(AvailableCorrectors.m2m1001B.value) correctoren = T5ModelForSpellingCorruption.frompretrained(AvailableCorrectors.ent5large.value)

print(correctorfred.correct(textru))

['Заметьте, не я это предложил.']

print(correctorm2m.correct(textru))

['Заметьте не я это предложил']

print(correctoren.correct(texten, prefix="grammar: "))

['Screw you guys, I am going home. (c)']

```

Evaluate performance of the models on open benchmarks for spelling correction: ```python import os import torch from sage.utils import DatasetsAvailable from sage.spellingcorrection import AvailableCorrectors from sage.spellingcorrection import T5ModelForSpellingCorruption

correctorfred95m = T5ModelForSpellingCorruption.frompretrained(AvailableCorrectors.sagefredt5distilled95m.value) correctormt5 = T5ModelForSpellingCorruption.frompretrained(AvailableCorrectors.sagemt5large.value)

correctorfred95m.model.to(torch.device("cuda:0")) corrector_mt5.model.to(torch.device("cuda:0"))

metrics = correctorfred95m.evaluate("RUSpellRU", metrics=["errant", "ruspelleval"], batch_size=32) print(metrics)

{'CASEPrecision': 94.41, 'CASERecall': 92.55, 'CASEF1': 93.47, 'SPELLPrecision': 77.52, 'SPELLRecall': 64.09, 'SPELLF1': 70.17, 'PUNCTPrecision': 86.77, 'PUNCTRecall': 80.59, 'PUNCTF1': 83.56, 'YOPrecision': 46.21, 'YORecall': 73.83, 'YOF1': 56.84, 'Precision': 83.48, 'Recall': 74.75, 'F1': 78.87}

metrics = correctormt5.evaluate("/content/sage/data/exampledata/jfleg", metrics=["ruspelleval"], batch_size=16) print(metrics)

{'Precision': 75.94, 'Recall': 88.15, 'F1': 81.59}

```

NOTE: if you are launching code snippet in Colab you'd probably end up with MEMORY ERROR, so manage evaluation procedures so that you meet available device's restrictions. As a feasible workaround you can execute python del corrector_fred_95m.model to free some space.

Spelling Corruption

We implemented two methods for spelling corruption. Statistic-based Spelling Corruption (SBSC) aims to mimic human behaviour when making an error. While Augmentex relies on rule-based heuristics and common errors and mistypings especially those committed while typing text on a keyboard.

🚀 Both methods proved their effectiveness for spelling correction systems and celebrated substantial performance gains fully reported in our Paper.

Statistic-based Spelling Corruption (SBSC)

This method is thoroughly described in our another Paper and in this 🗣️Talk.

Briefly, SBSC follows two simple steps: - 🧠 Analyze errors, their type and positions in a source text; - ✏️ Reproduce errors from the source text in a new sentence;

🧠 To analyze errors in a source sentence we need its corresponding correction in order to build Levenshtein matrix, traverse it back starting from the bottom right entry and determine the exact position and type of an error. We then aggregate all obtained statistics and normalize it to valid discrete distributions.

✏️ "Reproduce" step is even less complicated: we just sample number of errors per sentence, their types and relative positions from corresponding distributions and apply them to a correct sentence.

As stated, you need a parallel dataset to "fit" SBSC. We provide a set of four datasets with natural errors covering exhaustive range of domains:

  • RUSpellRU: texts collected from LiveJournal, with manually corrected typos and errors;
  • MultidomainGold: examples from 7 text sources, including the open web, news, social media, reviews, subtitles, policy documents and literary works;
  • MedSpellChecker: texts with errors from medical anamnesis;
  • GitHubTypoCorpusRu: spelling errors and typos in commits from GitHub;

You can use them as simple as ```python import sage from sage.spelling_corruption import SBSCConfig, SBSCCorruptor from sage.utils import DatasetsAvailable

Instantiate SBSC corruptor from a dataset with errors in medical anamnesis

config = SBSCConfig( referencedatasetnameorpath=DatasetsAvailable.MedSpellchecker.name, referencedatasetsplit="test" ) corruptor = SBSCCorruptor.from_config(config) ```

... or you can initialize your SBSC from locally stored dataset: ```python import os from sage.spelling_corruption import SBSCConfig, SBSCCorruptor

Instantiate SBSC corruptor from a JFLEG dataset

config = SBSCConfig( lang="en", referencedatasetnameorpath=os.path.join("data", "exampledata", "jfleg"), ) corruptor = SBSCCorruptor.fromconfig(config) ```

✅ To check how good SBSC actually approximates original errors, you can plot side-by-side graphs of original and synthetically generated distributions:



To access these graphs you can simply ```python from sage.utils import loadavailabledatasetfromhf, drawandsaveerrorsdistributionscomparisoncharts from sage.spellingcorruption.sbsc.labeler import processmistypings from sage.spelling_corruption import SBSCCorruptor

sources, corrections = loadavailabledatasetfromhf("RUSpellRU", forlabeler=True, split="train") ruspellrustats, ruspellruconfusionmatrix, ruspellrutyposcnt = process_mistypings(sources, corrections)

corruptor = SBSCCorruptor.fromdefaultconfig() spoiledsentences = corruptor.batchcorrupt(corrections)

sbscstats, sbscconfusionmatrix, sbsctyposcnt = processmistypings(spoiled_sentences, corrections)

drawandsaveerrorsdistributionscomparisoncharts( actualtyposcnt = sbsctyposcnt, referencetyposcnt=ruspellrutyposcnt, actualstats=sbscstats, referencestats=ruspellrustats, pathtosave="ruspellru_sbsc.jpg" ) ```

Augmentex

Augmentex introduces rule-based and common statistic (empowered by KartaSlov project) approach to insert errors in text. It is fully described again in the Paper and in this 🗣️Talk.

🖇️ Augmentex allows you to operate on two levels of granularity when it comes to text corruption and offers you sets of specific methods suited for particular level: - Word level: - replace - replace a random word with its incorrect counterpart; - delete - delete random word; - swap - swap two random words; - stopword - add random words from stop-list; - reverse - change a case of the first letter of a random word; - Character level: - shift - randomly swaps upper / lower case in a string; - orfo - substitute correct characters with their common incorrect counterparts; - typo - substitute correct characters as if they are mistyped on a keyboard; - delete - delete random character; - multiply - multiply random character; - swap - swap two adjacent characters; - insert - insert random character;

To access Augmentex you only need these few manipulations: ```python from sage.spelling_corruption import CharAugConfig, CharAugCorruptor

config = CharAugConfig( unitprob=0.3, # proportion of characters that is going to undergo edits minaug=1, # minimum number of edits maxaug=5, # maximum number of edits multnum=3 # multiply edit ) corruptor = CharAugCorruptor.from_config(config) ```

... or like this:

```python from sage.spelling_corruption import WordAugConfig, WordAugCorruptor

config = WordAugConfig( unitprob=0.4, # proportion of characters that is going to undergo edits minaug=1, # minimum number of edits maxaug=5, # maximum number of edits ) corruptor = WordAugCorruptor.fromconfig(config) ```

Augmentex has been created by our fellow team, the project has its own repo, do not forget to take a look!

Spelling Correction

Our methodology for obtaining model with optimal performance on spellchecking task is thoroughly described in our Paper. And the algorithm is simple and generally consists of two steps:

  • Pre-train model on extensive parallel corpus with synthetically generated errors;
  • Fine-tune on combinations of available datasets for spelling correction with "human-made" errors;

We use Augmentex and SBSC for both generating large synthetic corpora and augmenting datasets with natural errors. The family of pre-trained correctors now amounts for 8 models.

We've 6 🤗Transformer models for Russian 🇷🇺: - sage-fredt5-large - sage-fredt5-distilled-95m - sage-m2m100-1.2B - M2M100-1.2B [Earlier release] - M2M100-418M [Earlier release] - FredT5-large [Earlier release]

And two models for English 🇬🇧: - T5-large - sage-mt5-large

Models for the Russian language have been pre-trained on combination of Russian Wikipedia and videos transcriptions with artificial errors generated by SBSC on statistics gathered from train split of RUSpellRU. Correctors for English trained on mixture of English Wikipedia articles and news posts with synthetic errors inserted by SBSC fitted on statistics from 5k subsample of BEA60k.

📚 We also validate our solutions on available datasets with "human-made" errors:

  • RUSpellRU: texts collected from LiveJournal, with manually corrected typos and errors;
  • MultidomainGold: examples from 7 text sources, including the open web, news, social media, reviews, subtitles, policy documents and literary works;
  • MedSpellChecker: texts with errors from medical anamnesis;
  • GitHubTypoCorpusRu: spelling errors and typos in commits from GitHub;
  • BEA60K: English spelling errors collected from several domains;
  • JFLEG: 1601 sentences in English, which contain about 2 thousand spelling errors;

📈 Here we report evaluation of some setups: - Zero-shot evaluation of pre-trained checkpoints; - Additional fine-tuning (ft.) on the target dataset;

Full list of setups and corresponding performances are in the Paper.

RUSpellRU, MultidomainGold, MedSpellChecker and GitHubTypoCorpusRu come from spellcheckpunctuationbenchmark. The benchmark accounts for both punctuation and spelling errors. For the simplicity and better representativeness we report results only for those models (sage-fredt5-large, sage-fredt5-distilled-95m) that deal with both types of errors (the Russian language). The detailed metrics for other checkpoints can be found either in the Paper, post or corresponding model card.

NOTE: MedSpellChecker and GitHubTypoCorpusRu do not have train split, so their performance on Pre-train + fine-tune setup is reported as a result of fine-tuning on combination of RUSpellRU and MultidomainGold datasets.

RUSpellRU Evaluation

| Model | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) | |----------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | | sage-ai-service | 90.3 | 86.3 | 88.2 | 90.3 | 86.6 | 88.4 | 95.2 | 95.9 | 95.6 | | sage-fredt5-large | 57.3 | 68.0 | 62.2 | 86.7 | 46.1 | 60.2 | 92.1 | 67.8 | 78.1 | | sage-fredt5-large (ft.) | 88.4 | 80.9 | 84.5 | 88.2 | 85.3 | 86.8 | 95.5 | 94.0 | 94.7 | | sage-fredt5-distilled-95m (ft.) | 83.5 | 74.8 | 78.9 | 86.8 | 80.6 | 83.6 | 94.4 | 92.5 | 93.5 | | gpt-3.5-turbo | 33.6 | 58.5 | 42.7 | 85.9 | 64.6 | 73.7 | 84.9 | 73.9 | 79.0 | | gpt-4 | 54.9 | 76.7 | 64.0 | 84.0 | 82.3 | 83.2 | 91.5 | 90.2 | 90.9 |

MultidomainGold Evaluation

| Model | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) | |---------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | | sage-ai-service | 81.6 | 77.7 | 79.6 | 70.2 | 67.5 | 68.8 | 80.5 | 80.5 | 80.5 | | sage-fredt5-large | 43.4 | 49.7 | 46.3 | 21.8 | 21.3 | 21.6 | 58.8 | 23.9 | 34.0 | | sage-fredt5-large (ft.) | 80.3 | 75.1 | 77.6 | 69.0 | 66.5 | 67.7 | 78.6 | 80.0 | 79.3 | | sage-fredt5-distilled-95m (ft.) | 77.2 | 69.9 | 73.4 | 66.8 | 63.4 | 65.0 | 76.8 | 79.1 | 77.9 | | gpt-3.5-turbo | 18.8 | 48.1 | 27.1 | 42.0 | 31.8 | 36.2 | 47.1 | 51.3 | 49.1 | | gpt-4 | 25.4 | 68.0 | 37.0 | 57.8 | 54.3 | 56.0 | 54.0 | 67.5 | 60.0 |

MedSpellchecker Evaluation

| Model | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) | |---------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | | sage-ai-service | 71.3 | 73.5 | 72.4 | 75.1 | 69.2 | 72.0 | 80.9 | 72.8 | 76.6| | sage-fredt5-large | 35.2 | 54.5 | 42.8 | 19.2 | 13.2 | 15.7 | 48.7 | 36.8 | 41.9 | | sage-fredt5-large (ft.) | 72.5 | 72.2 | 72.3 | 74.6 | 66.4 | 70.3 | 79.3 | 85.1 | 82.1 | | sage-fredt5-distilled-95m (ft.) | 65.1 | 64.8 | 64.9 | 78.6 | 63.1 | 70.0 | 63.5 | 74.7 | 68.7 | | gpt-3.5-turbo | 14.7 | 45.9 | 22.3 | 69.9 | 52.3 | 59.8 | 26.4 | 41.8 | 32.3 | | gpt-4 | 37.8 | 72.3 | 49.6 | 81.4 | 64.3 | 71.9 | 73.0 | 62.1 | 67.1 |

GitHubTypoCorpusRu Evaluation

| Model | Pr. (spell) | Rec. (spell) | F1 (spell) | Pr. (punc) | Rec. (punc) | F1 (punc) | Pr. (case) | Rec. (case) | F1 (case) | |---------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | | sage-ai-service | 70.8 | 56.3 | 62.7 | 48.9 | 35.8 | 41.4 | 32.9 | 45.3 | 38.1| | sage-fredt5-large | 46.0 | 46.6 | 46.3 | 22.7 | 18.3 | 20.2 | 12.0 | 13.2 | 12.6 | | sage-fredt5-large (ft.) | 67.5 | 53.2 | 59.5 | 48.5 | 38.0 | 42.6 | 37.3 | 50.0 | 42.7 | | sage-fredt5-distilled-95m (ft.) | 57.8 | 48.5 | 52.7 | 45.2 | 39.5 | 42.1 | 29.9 | 46.2 | 36.3 | | gpt-3.5-turbo | 23.7 | 38.7 | 29.4 | 37.6 | 23.3 | 28.7 | 19.6 | 35.9 | 25.3 | | gpt-4 | 27.0 | 52.8 | 35.7 | 45.9 | 32.6 | 38.2 | 25.7 | 36.8 | 30.2 |

BEA60K Evaluation

| Model | Precision | Recall | F1 | | --- | --- | --- | --- | | sage-mt5-large | 64.7 | 83.8 | 73.0 | | T5-large-spell | 66.5 | 83.1 | 73.9 | | gpt-3.5-turbo | 66.9 | 84.1 | 74.5 | | gpt-4 | 68.6 | 85.2 | 76.0 | | Bert | 65.8 | 79.6 | 72.0 | | SC-LSTM | 62.2 | 80.3 | 72.0 |

JFLEG Evaluation

| Model | Precision | Recall | F1 | | --- | --- | --- | --- | | sage-mt5-large | 74.9 | 88.4 | 81.1 | | T5-large-spell | 83.4 | 84.3 | 83.8 | | gpt-3.5-turbo | 77.8 | 88.6 | 82.9 | | gpt-4 | 77.9 | 88.3 | 82.8 | | Bert | 78.5 | 85.4 | 81.8 | | SC-LSTM | 80.6 | 86.1 | 83.2 |

RUSpellRU, MultidomainGold, MedSpellChecker and GitHubTypoCorpusRu are available as HuggingFace datasets here and through the API of our library: ```python from sage.utils import loadavailabledatasetfromhf, DatasetsAvailable

print([dataset.name for dataset in DatasetsAvailable])

['MultidomainGold', 'RUSpellRU', 'MedSpellchecker', 'GitHubTypoCorpusRu', 'MultidomainGoldorth', 'RUSpellRUorth', 'MedSpellcheckerorth', 'GitHubTypoCorpusRuorth']

golddataset = loadavailabledatasetfromhf(DatasetsAvailable.MultidomainGold.name, forlabeler=False) print(len(gold_dataset))

7675

sources, corrections = loadavailabledatasetfromhf(DatasetsAvailable.RUSpellRU.name, for_labeler=True, split="train") print(len(sources), len(corrections))

2000 2000

```

Evaluation

We also provide functionality to evaluate the performance of spelling correction systems and rank them.

🎯 Currently two options are available: - ruspelleval; - ERRANT-based metric adapted for the Russian language;

Both algorithms output Precision, Recall and F1 scores that can be interpreted like the following: - Precision: one minus share of unnecessary amendments; - Recall: proportion of expected corrections; - F1: famous geometric mean of aforementioned two;

You can obtain these metrics simply by ```python from sage.evaluation import Scorer from sage.utils import DatasetsAvailable, loadavailabledatasetfromhf

sources, corrections = loadavailabledatasetfromhf(DatasetsAvailable.RUSpellRU.name, for_labeler=True, split="test")

scorer = Scorer() metrics = scorer.score(sources, corrections, corrections, metrics=["ruspelleval", "errant"]) print(metrics)

{'Precision': 100.0, 'Recall': 100.0, 'F1': 100.0, 'CASEPrecision': 100.0, 'CASERecall': 100.0, 'CASEF1': 100.0, 'SPELLPrecision': 100.0, 'SPELLRecall': 100.0, 'SPELLF1': 100.0, 'PUNCTPrecision': 100.0, 'PUNCTRecall': 100.0, 'PUNCTF1': 100.0, 'YOPrecision': 100.0, 'YORecall': 100.0, 'YOF1': 100.0}

```

... or by directly assessing the model: ```python import os import torch from sage.utils import DatasetsAvailable from sage.spellingcorrection import AvailableCorrectors from sage.spellingcorrection import T5ModelForSpellingCorruption

correctorfred95m = T5ModelForSpellingCorruption.frompretrained(AvailableCorrectors.sagefredt5distilled95m.value) correctormt5 = T5ModelForSpellingCorruption.frompretrained(AvailableCorrectors.sagemt5large.value)

correctorfred95m.model.to(torch.device("cuda:0")) corrector_mt5.model.to(torch.device("cuda:0"))

metrics = correctorfred95m.evaluate("RUSpellRU", metrics=["errant", "ruspelleval"], batch_size=32) print(metrics)

{'CASEPrecision': 94.41, 'CASERecall': 92.55, 'CASEF1': 93.47, 'SPELLPrecision': 77.52, 'SPELLRecall': 64.09, 'SPELLF1': 70.17, 'PUNCTPrecision': 86.77, 'PUNCTRecall': 80.59, 'PUNCTF1': 83.56, 'YOPrecision': 46.21, 'YORecall': 73.83, 'YOF1': 56.84, 'Precision': 83.48, 'Recall': 74.75, 'F1': 78.87}

metrics = correctormt5.evaluate("/content/sage/data/exampledata/jfleg", metrics=["ruspelleval"], batch_size=16) print(metrics)

{'Precision': 75.94, 'Recall': 88.15, 'F1': 81.59}

```

The metrics output by ERRANT based algorithm are indicated by the corresponding prefix, which refers to the specific type of errors: - CASE: erroneously used case; - SPELL: spelling and grammar errors; - PUNCT: punctuation errors; - YO: unnecessary replacement of "YO" (ё) letter;

📌 Credit for evaluation script of ruspelleval metric goes to Aleksei Sorokin and his notable work in proceedings of SpellRueval.

Citation

If you want to know more about our work take a look at these publications:

💥 Our EACL 2024 Paper provides a thorough description of the methodology used to obtain SOTA models for spelling corrections as well the comprehensive reports of all experiments that have been carried out.

💫 While our Dialogue-2023 Paper focuses on exploiting resources for the task of spelling correction and procedures on obtaining high-quality parallel corpuses.

``` @inproceedings{martynov-etal-2024-methodology, title = "A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages", author = "Martynov, Nikita and Baushenko, Mark and Kozlova, Anastasia and Kolomeytseva, Katerina and Abramov, Aleksandr and Fenogenova, Alena", editor = "Graham, Yvette and Purver, Matthew", booktitle = "Findings of the Association for Computational Linguistics: EACL 2024", month = mar, year = "2024", address = "St. Julian{'}s, Malta", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-eacl.10", pages = "138--155", abstract = "Large language models excel in text generation and generalization, however they face challenges in text editing tasks, especially in correcting spelling errors and mistyping.In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models{'} pre-train procedure effectively. We investigate the effects of emulations in various text domains and examine two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.We conducted experiments employing various corruption strategies, models{'} architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation).", }

@inproceedings{martynov2023augmentation, title={Augmentation methods for spelling corruptions}, author={Martynov, Nikita and Baushenko, Mark and Abramov, Alexander and Fenogenova, Alena}, booktitle={Proceedings of the International Conference “Dialogue}, volume={2023}, year={2023} } ```

📌 Feel free to ask any questions regarding our work at corresponding point of contact:

nikita.martynov.98@list.ru

Owner

  • Name: AI Forever
  • Login: ai-forever
  • Kind: organization
  • Location: Armenia

Creating ML for the future. AI projects you already know. We are non-profit organization with members from all over the world.

GitHub Events

Total
  • Issues event: 1
  • Watch event: 25
  • Issue comment event: 1
  • Push event: 8
  • Pull request event: 1
  • Fork event: 2
  • Create event: 1
Last Year
  • Issues event: 1
  • Watch event: 25
  • Issue comment event: 1
  • Push event: 8
  • Pull request event: 1
  • Fork event: 2
  • Create event: 1

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 61
  • Total Committers: 6
  • Avg Commits per committer: 10.167
  • Development Distribution Score (DDS): 0.623
Past Year
  • Commits: 1
  • Committers: 1
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Nikita Martynov n****v@M****l 23
Nikita Martynov n****8@l****u 18
Ulyana Isaeva u****0@g****m 8
Nikita Martynov n****v@1****5 6
danil31219as d****s@g****m 4
Nikita Martynov n****v@1****2 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 5
  • Total pull requests: 16
  • Average time to close issues: 4 days
  • Average time to close pull requests: 3 days
  • Total issue authors: 4
  • Total pull request authors: 4
  • Average comments per issue: 1.4
  • Average comments per pull request: 0.06
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 0.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • shkarupa-alex (2)
  • todd-cook (1)
  • Mihey-38 (1)
Pull Request Authors
  • meduzick (12)
  • danil31219as (4)
  • StrangePineAplle (4)
  • ulyanaisaeva (2)
  • dodecaphony (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
proxy.golang.org: github.com/ai-forever/sage
  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.4%
Average: 5.6%
Dependent repos count: 5.7%
Last synced: 10 months ago