soc_bias

Reproduction for NAACL paper on Socially Aware Bias Measurements for Hindi

https://github.com/iamshnoo/soc_bias

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary

Keywords

bias nlp pytorch

Last synced: 10 months ago · JSON representation ·

Repository

Reproduction for NAACL paper on Socially Aware Bias Measurements for Hindi

Basic Info

Host: GitHub
Owner: iamshnoo
License: gpl-3.0
Language: Python
Default Branch: main
Homepage: https://paperswithcode.com/paper/socially-aware-bias-measurements-for-hindi
Size: 367 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Topics

bias nlp pytorch

Created about 3 years ago · Last pushed about 3 years ago

Metadata Files

Readme License Citation

Socially Aware Bias Measurements for Hindi Language Representations

This repository contains the code for the NAACL 2022 paper:

Socially Aware Bias Measurements for Hindi Language Representations [Link to Paper]

Reproduction steps

```bash

0. Clone the repository (you should have git installed)

git clone "https://github.com/iamshnoo/soc_bias"

1. Create a virtual environment (Any python version >= 3.6 should work)

cd socbias python3 -m venv socialbias source socialbias/bin/activate pip install numpy scipy simpleelmo tensorflow tqdm pip install -e . git clone https://github.com/facebookresearch/fastText.git cd fastText pip install . cd ..

2. Download word embeddings (Provided in Release v1.0.1/Assets)

cd src python downloadelmo.py (embeddings stored in src/elmomodels) python downloadglove.py (embeddings stored in src/glovemodels) cd ..

3. Run the experiments (Elmo takes 9 hours on all tests, Glove is very fast)

cd src python seattest.py (Use the --help flag to see the options) python weattest.py (Use the --help flag to see the options) cd ..

4. Dataset (provided)

data/seat (contains SEAT data) data/weat (contains WEAT data) ```

For the reproduction of results in Hindi, follow the instructions mentioned in the code block above.

Results

Table 1:

Table 1

Blank represents results that cannot be reproduced because English word/sentence lists are not available for this directly and hence cannot be translated. These are highlighted in blue.

Table 2:

Table 2

Yellow represents significant difference between reproduced results and the results in the paper, for both the tables.

Note

Dataset (provided)

| Data Type | Folder Path | Description | |-----------------------------|------------------|--------------------------------------------------------------------------------------------------| | SEAT Data | data/seat | Contains SEAT data; subfolders for each language, including hi for Hindi | | WEAT Data | data/weat | Contains WEAT data; subfolders for each language, including hi for Hindi | | Hindi Translated Data | hi/trans | Use translated data (located within data/seat/hi and data/weat/hi) | | Hindi Language Specific Data| hi/lang_spec | Use language-specific data (located within data/seat/hi and data/weat/hi) as mentioned in the paper |

data/seat/hi also has a file called "templates.jsonl" which contains the templates used to generate the SEAT sentences from the WEAT word lists using the file "src/generateseatdata.py" with the command python generateseatdata.py. Only lang_spec data is to be used for this process. Translated data for SEAT is to be obtained by directly translating the corresponding English SEAT sentences using Google Translate.

So, we have the following data folders for Hindi, for example:

| Data Type | Folder Path | Description | |-------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------| | WEAT Hindi Translated data | data/weat/hi/trans | Translate data/weat/en files using Google Translate | | WEAT Hindi Language Specific | data/weat/hi/lang_spec | Use manually created word lists defined in the paper appendix | | SEAT Hindi Translated data | data/seat/hi/trans | Translate data/seat/en files using Google Translate | | SEAT Hindi Language Specific | data/seat/hi/lang_spec | Use the templates.jsonl file as input to the generate_seat_data.py file to generate SEAT sentences |

Results (provided)

| Results Type | Folder Path | Description | |-------------------------------|----------------------------|-------------------------------------| | SEAT Hindi Language Specific | results/seat/hi/lang_spec | Contains results from GloVe and ELMo | | SEAT Hindi Translated | results/seat/hi/trans | Contains results from GloVe | | WEAT Hindi Language Specific | results/weat/hi/lang_spec | Contains results from GloVe | | WEAT Hindi Translated | results/weat/hi/trans | Contains results from GloVe |

These four result files are sufficient to reproduce the results in Table 1 and 2 in the paper.

In the JSON files that we have for results, here is what each of the numbers represents:

| ID | Description | |-----|----------------------------------------------| | 7 | maths, arts vs male, female | | 8 | science, arts vs male, female | | 11 | adjectives vs male, female | | 12 | gendered verbs vs male, female | | 13 | gendered adjectives vs male, female | | 14 | gendered entities vs male, female | | 15 | gendered titles vs male, female | | 16 | occupations vs caste | | 17 | adjectives vs caste | | 18 | adjectives vs religion terms | | 19 | adjectives vs lastnames | | 20 | religious entities vs religion | | 21 | adjectives vs urban, rural occupations |

Translated data are only available for id 7 and 8, because we only have English SEAT data for these two ids. Language-specific data is available for all ids.

The results in Table 1 and 2 are of the form: effectsize (pvalue) corresponding to each of the ids given here.

Owner

Name: Anjishnu
Login: iamshnoo
Kind: user
Location: Virginia, USA

Website: https://iamshnoo.github.io/
Twitter: iamshnoo
Repositories: 3
Profile: https://github.com/iamshnoo

CS PhD student. Generative models are cool 🥺

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Mukherjee"
  given-names: "Anjishnu"
  orcid: "https://orcid.org/0000-0003-4012-8466"
- family-names: "Raj"
  given-names: "Chahat"
  orcid: "https://orcid.org/0000-0003-0083-6812"
title: "Reproduction Code for the paper 'Socially Aware Bias Measurements for Hindi Language Representations'"
version: 1.0.0
#doi: 10.5281/zenodo.1234
date-released: 2023-04-05
url: "https://github.com/iamshnoo/soc_bias"

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 9
Total Committers: 1
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
iamshnoo	m**u@g**m	9

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

soc_bias

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Socially Aware Bias Measurements for Hindi Language Representations

Reproduction steps

0. Clone the repository (you should have git installed)

1. Create a virtual environment (Any python version >= 3.6 should work)

2. Download word embeddings (Provided in Release v1.0.1/Assets)

3. Run the experiments (Elmo takes 9 hours on all tests, Glove is very fast)

4. Dataset (provided)

Results

Note

Dataset (provided)

Results (provided)

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels