s_jsd-multilingual-bias

Code and data for the paper "An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models" (Findings of NAACL 2022)

https://github.com/vsteinborn/s_jsd-multilingual-bias

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Keywords

dataset gender-bias information-theory metrics nlp translation

Keywords from Contributors

interactive mesh interpretability profiles sequences generic projection standardization optim embedded

Last synced: 10 months ago · JSON representation ·

Repository

Code and data for the paper "An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models" (Findings of NAACL 2022)

Basic Info

Host: GitHub
Owner: VSteinborn
Language: Python
Default Branch: main
Homepage:
Size: 289 KB

Statistics

Stars: 5
Watchers: 1
Forks: 0
Open Issues: 5
Releases: 0

Topics

dataset gender-bias information-theory metrics nlp translation

Created about 4 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Citation

S_JSD Multilingual Gender Bias

This Repository presents the code, data and supplementary material used for the paper "An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models" (Findings of NAACL 2022)

Dataset

The dataset consists of edited and translated CrowS-Pairs sentence pairs. The sentences have been modified according to the suggestions of Blodgett et al. (2021) prior to translation. Translators were supplied translation instructions in the corresponding instruction sheet.

The dataset consists of five csv files, one for each language. The language of the the csv file is indicated by the language code in its file name:

English (en), German (de), Thai (th), Indonesian (id) and Finnish (fi)

The columns of the csv files have the following meanings:

ID: The row in the CrowS-Pairs dataset where the original version of the sentence pair may be found.
A_en: The edited english version of the more stereotypical CrowS-Pairs sentence.
B_en: The edited english version of the less stereotypical CrowS-Pairs sentence. (A swapped variant of A_en)
A_x: The translation of A_en into the target language.
B_x: The translation of B_en into the target language.
stereo_antistereo: The bias direction from the CrowS-Pairs study

Scripting

In this work we used Python 3.8.11 with the packages listed in requirements.txt. The required packages may be installed via:

pip install -r requirements.txt

Subsequently, the script may be run via the following command.

python main.py --input INPUT path to sentence pairs --out_dir OUT_DIR path to output directory for sentence-level data --model { Model to use in analysis bert-multi, mBERT (cased) xlm-roberta, xlmR (base) xlm-roberta-L, xlmR (large) bert, BERT (base-uncased) roberta, RoBERTa (large) albert} ALBERT (xxlarge-v2) [--perturb] Removes the final character of each sentence

Results of the measures will be printed to the terminal, which may be piped using >>, for example, to a text file.

License

The dataset associated with this paper is based on the CrowS-Pairs dataset, which has been licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Thus, this dataset falls under the same license. For more information on the construction of the original CrowS-Pairs dataset, please refer to their paper.

Owner

Name: Victor Steinborn
Login: VSteinborn
Kind: user

Website: https://vsteinborn.github.io/
Repositories: 2
Profile: https://github.com/VSteinborn

Citation (CITATION.cff)

# This CITATION.cff file was generated with Zotero.

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 22
Total Committers: 2
Avg Commits per committer: 11.0
Development Distribution Score (DDS): 0.5

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
VSteinborn	v**n@g**m	11
dependabot[bot]	4****]	11

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 17
Average time to close issues: N/A
Average time to close pull requests: 9 days
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.06
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 17

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 3

View more stats

Top Authors

Issue Authors

Pull Request Authors

dependabot[bot] (24)

Top Labels

Issue Labels

Pull Request Labels

dependencies (24)

Dependencies

requirements.txt pypi

PyYAML ==6.0
certifi ==2023.7.22
charset-normalizer ==2.0.10
click ==8.0.3
filelock ==3.4.2
huggingface-hub ==0.2.1
idna ==3.3
joblib ==1.2.0
numpy ==1.22.0
packaging ==21.3
pandas ==1.3.5
pyparsing ==3.0.6
python-dateutil ==2.8.2
pytz ==2021.3
regex ==2021.11.10
requests ==2.31.0
sacremoses ==0.0.47
scipy ==1.10.0
six ==1.16.0
tokenizers ==0.10.3
torch ==1.13.1
tqdm ==4.62.3
transformers ==4.30.0
typing-extensions ==4.0.1
urllib3 ==1.26.18

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

s_jsd-multilingual-bias

Science Score: 44.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

S_JSD Multilingual Gender Bias

Dataset

Scripting

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

s_jsd-multilingual-bias

Science Score: 44.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SJSD Multilingual Gender Bias

Dataset

Scripting

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

S_JSD Multilingual Gender Bias