soc_bias

Reproduction for NAACL paper on Socially Aware Bias Measurements for Hindi

https://github.com/iamshnoo/soc_bias

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary

Keywords

bias nlp pytorch
Last synced: 7 months ago · JSON representation ·

Repository

Reproduction for NAACL paper on Socially Aware Bias Measurements for Hindi

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Topics
bias nlp pytorch
Created about 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md

Socially Aware Bias Measurements for Hindi Language Representations

This repository contains the code for the NAACL 2022 paper:

Socially Aware Bias Measurements for Hindi Language Representations [Link to Paper]

Reproduction steps

```bash

0. Clone the repository (you should have git installed)

git clone "https://github.com/iamshnoo/soc_bias"

1. Create a virtual environment (Any python version >= 3.6 should work)

cd socbias python3 -m venv socialbias source socialbias/bin/activate pip install numpy scipy simpleelmo tensorflow tqdm pip install -e . git clone https://github.com/facebookresearch/fastText.git cd fastText pip install . cd ..

2. Download word embeddings (Provided in Release v1.0.1/Assets)

cd src python downloadelmo.py (embeddings stored in src/elmomodels) python downloadglove.py (embeddings stored in src/glovemodels) cd ..

3. Run the experiments (Elmo takes 9 hours on all tests, Glove is very fast)

cd src python seattest.py (Use the --help flag to see the options) python weattest.py (Use the --help flag to see the options) cd ..

4. Dataset (provided)

data/seat (contains SEAT data) data/weat (contains WEAT data) ```

For the reproduction of results in Hindi, follow the instructions mentioned in the code block above.

Results

Table 1:

Table 1

Blank represents results that cannot be reproduced because English word/sentence lists are not available for this directly and hence cannot be translated. These are highlighted in blue.

Table 2:

Table 2

Yellow represents significant difference between reproduced results and the results in the paper, for both the tables.

Note

Dataset (provided)

| Data Type | Folder Path | Description | |-----------------------------|------------------|--------------------------------------------------------------------------------------------------| | SEAT Data | data/seat | Contains SEAT data; subfolders for each language, including hi for Hindi | | WEAT Data | data/weat | Contains WEAT data; subfolders for each language, including hi for Hindi | | Hindi Translated Data | hi/trans | Use translated data (located within data/seat/hi and data/weat/hi) | | Hindi Language Specific Data| hi/lang_spec | Use language-specific data (located within data/seat/hi and data/weat/hi) as mentioned in the paper |

data/seat/hi also has a file called "templates.jsonl" which contains the templates used to generate the SEAT sentences from the WEAT word lists using the file "src/generateseatdata.py" with the command python generateseatdata.py. Only lang_spec data is to be used for this process. Translated data for SEAT is to be obtained by directly translating the corresponding English SEAT sentences using Google Translate.

So, we have the following data folders for Hindi, for example:

| Data Type | Folder Path | Description | |-------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------| | WEAT Hindi Translated data | data/weat/hi/trans | Translate data/weat/en files using Google Translate | | WEAT Hindi Language Specific | data/weat/hi/lang_spec | Use manually created word lists defined in the paper appendix | | SEAT Hindi Translated data | data/seat/hi/trans | Translate data/seat/en files using Google Translate | | SEAT Hindi Language Specific | data/seat/hi/lang_spec | Use the templates.jsonl file as input to the generate_seat_data.py file to generate SEAT sentences |

Results (provided)

| Results Type | Folder Path | Description | |-------------------------------|----------------------------|-------------------------------------| | SEAT Hindi Language Specific | results/seat/hi/lang_spec | Contains results from GloVe and ELMo | | SEAT Hindi Translated | results/seat/hi/trans | Contains results from GloVe | | WEAT Hindi Language Specific | results/weat/hi/lang_spec | Contains results from GloVe | | WEAT Hindi Translated | results/weat/hi/trans | Contains results from GloVe |

These four result files are sufficient to reproduce the results in Table 1 and 2 in the paper.

In the JSON files that we have for results, here is what each of the numbers represents:

| ID | Description | |-----|----------------------------------------------| | 7 | maths, arts vs male, female | | 8 | science, arts vs male, female | | 11 | adjectives vs male, female | | 12 | gendered verbs vs male, female | | 13 | gendered adjectives vs male, female | | 14 | gendered entities vs male, female | | 15 | gendered titles vs male, female | | 16 | occupations vs caste | | 17 | adjectives vs caste | | 18 | adjectives vs religion terms | | 19 | adjectives vs lastnames | | 20 | religious entities vs religion | | 21 | adjectives vs urban, rural occupations |

Translated data are only available for id 7 and 8, because we only have English SEAT data for these two ids. Language-specific data is available for all ids.

The results in Table 1 and 2 are of the form: effectsize (pvalue) corresponding to each of the ids given here.

Owner

  • Name: Anjishnu
  • Login: iamshnoo
  • Kind: user
  • Location: Virginia, USA

CS PhD student. Generative models are cool 🥺

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Mukherjee"
  given-names: "Anjishnu"
  orcid: "https://orcid.org/0000-0003-4012-8466"
- family-names: "Raj"
  given-names: "Chahat"
  orcid: "https://orcid.org/0000-0003-0083-6812"
title: "Reproduction Code for the paper 'Socially Aware Bias Measurements for Hindi Language Representations'"
version: 1.0.0
#doi: 10.5281/zenodo.1234
date-released: 2023-04-05
url: "https://github.com/iamshnoo/soc_bias"

GitHub Events

Total
Last Year

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 9
  • Total Committers: 1
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
iamshnoo m****u@g****m 9

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels