corrupted-text

A python library to generate out-of-distribution text datasets. Specifically, the library applies model-independent, commonplace corruptions (not model-specific, worst-case adversarial corruptions). We thus aim to allow benchmark-studies regarding robustness against realistic outliers.

https://github.com/testingautomated-usi/corrupted-text

Last synced: 7 months ago · JSON representation ·

Repository

A python library to generate out-of-distribution text datasets. Specifically, the library applies model-independent, commonplace corruptions (not model-specific, worst-case adversarial corruptions). We thus aim to allow benchmark-studies regarding robustness against realistic outliers.

Basic Info

Host: GitHub
Owner: testingautomated-usi
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 3.94 MB

Statistics

Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 1

Created almost 4 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License Citation

README.md

Corrupted-Text: Realistic Out-of-Distribution Texts

A python library to generate out-of-distribution text datasets. Specifically, the library applies model-independent, commonplace corruptions (not model-specific, worst-case adversarial corruptions). We thus aim to allow benchmark-studies regarding robustness against realistic outliers.

Implemented Corruptions

Most corruptions are based on a set of common words, to which a corruptor is fitted. These common words may be domain specific and thus, the corruptor can be fitted with a base dataset from which the most common words are extracted.

Then, the following corruptions are randomly applied on a per-word basis:

Bad Autocorrection Words are replaced with another, common word to which it has a small levenshtein distance. This mimicks wrong autocorrection, as for example done by "intelligent" mobile phone keyboards.
Bad Autocompletion Words are replaced with another, common word with the same starting letters. This mimicks wrong autocompletion. If no common word with at least 3 common start letters is found, a bad autocorrection is attempted instead.
Bad Synonym Words are replaced with a synonym, accoring to a naive, flat mapping extracted from WordNet, ignoring the context. This mimicks dictionary based translations, which are often wrong. This assumes that you are using an english-language dataset.
Typo A single letter is replaced with another, randomly chosen letter.

To any word, at most one corruption is applied, i.e., corruptions are not applied on top of each other.

The severity ]0, 1] is a parameter to steer how many corruptions should be applied. It roughly corresponds to the percentage of words that should be corrupted (only rougly as not all bad autocompletion attempts are successful, and as sometimes, the bad synonyms consist of multiple words, thus extending the number of words in the text).

Optionally, users can define weights to each corruption type, steering how often they should be applied.

Accuracies

The following shows the accuracy of a regular, simple transformer model on the imdb sentiment classification dataset. Clearly, the higher the chosen corruption severity, the lower the model accuracy.

| Severity | 0 () | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 | 1 (max!) |
|------------|-------|-----|-----|-----|-----|------|----------| | *Accuracy | .87 | .81 | .78 | .75 | .71 | 0.66 | 0.64 |

(*) No corruption, original test set.

Installation

It's as simple as pip install corrupted-text.

You'll need python >= 3.7

Usage

Usage is very straigthforward. The following shows an example on how to corrupt the imdb sentiment classification dataset.

You can also run the example in colab:

```python import corruptedtext # pip install corrupted-text import logging from datasets import loaddataset # pip install datasets

Enable Detailed Logging

logging.basicConfig(level=logging.INFO)

Load the dataset (we use huggingface-datasets, but any list of strings is fine).

nominaltrain = loaddataset("imdb", split="train")["text"] nominaltest = loaddataset("imdb", split="test")["text"]

Fit a corruptor (we fit on the training and test set,

but as this takes a while, you'd want to choose a smaller subset for larger datasets)

corruptor = corruptedtext.TextCorruptor(basedataset=nominaltest + nominaltrain, cache_dir=".mycache")

Corrupt the test set with severity 0.5. The result is again a list of corrupted strings.

imdbcorrupted = corruptor.corrupt(nominaltest, severity=0.5, seed=1) ```

Citation

@inproceedings{Weiss2022SimpleTip, 
  title={Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replication Paper)},
  author={Weiss, Michael and Paolo, Tonella}, 
  booktitle={Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis},
  year={2022}
}

Other Corrupted Datasets

MNIST-C by Mu and Gilmer
CIFAR-10-C by Hendrycks and Dietterich
Imagenet-C by Hendrycks and Dietterich
Fashion-MNIST-C by Weiss and Tonella (i.e., same as corrupted-text)

Owner

Name: testingautomated-usi
Login: testingautomated-usi
Kind: organization

Repositories: 11
Profile: https://github.com/testingautomated-usi

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Simple Techniques Work Surprisingly Well for Neural
  Network Test Prioritization and Active Learning
  (Replication Paper)
message: >-
  When using this software, please cite paper from
  which this software is an artifact.
type: software
authors:
  - given-names: Michael
    family-names: Weiss
    email: michael.weiss@usi.ch
    affiliation: Università della Svizzera italiana
    orcid: 'https://orcid.org/0000-0002-8944-389X'
  - given-names: Paolo
    family-names: Tonella
    email: paolo.tonella@usi.ch
    affiliation: Università della Svizzera italiana
    orcid: 'https://orcid.org/0000-0003-3088-0339'
preferred-citation:
  type: article
  authors:
  - given-names: Michael
    family-names: Weiss
    email: michael.weiss@usi.ch
    affiliation: Università della Svizzera italiana
    orcid: 'https://orcid.org/0000-0002-8944-389X'
  - given-names: Paolo
    family-names: Tonella
    email: paolo.tonella@usi.ch
    affiliation: Università della Svizzera italiana
    orcid: 'https://orcid.org/0000-0003-3088-0339'
  journal: "Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis"
  title: "Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replication Paper)"
  year: 2022

GitHub Events

Total

Last Year

Committers

Last synced: 8 months ago

All Time

Total Commits: 6
Total Committers: 1
Avg Commits per committer: 6.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Michael Weiss	c**e@m**h	6

Committer Domains (Top 20 + Academic)

mweiss.ch: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 0
Total pull requests: 4
Average time to close issues: N/A
Average time to close pull requests: 21 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

MiWeiss (4)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 49 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 1
Total maintainers: 1

pypi.org: corrupted-text

Corruption of text datasets; model-independent and inspired byreal-world corruption causes.

Homepage: https://github.com/testingautomated-usi/corrupted-text
Documentation: https://corrupted-text.readthedocs.io/
License: MIT
Latest release: 0.2.0
published almost 4 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 49 Last month

Rankings

Dependent packages count: 10.0%

Dependent repos count: 21.7%

Average: 26.1%

Forks count: 29.8%

Stargazers count: 31.9%

Downloads: 36.8%

Maintainers (1)

mweiss

Last synced: 7 months ago

corrupted-text

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Corrupted-Text: Realistic Out-of-Distribution Texts

Implemented Corruptions

Accuracies

Installation

Usage

Enable Detailed Logging

Load the dataset (we use huggingface-datasets, but any list of strings is fine).

Fit a corruptor (we fit on the training and test set,

but as this takes a while, you'd want to choose a smaller subset for larger datasets)

Corrupt the test set with severity 0.5. The result is again a list of corrupted strings.

Citation

Other Corrupted Datasets

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: corrupted-text

Rankings

Maintainers (1)

Dependencies