genderednews-tools

https://github.com/ange-richard/genderednews-tools

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: ange-richard
License: agpl-3.0
Language: Python
Default Branch: main
Size: 292 KB

Statistics

Stars: 0
Watchers: 1
Forks: 1
Open Issues: 1
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

GenderedNews Tools

This repository contains the code to the tools used to compute two metrics used in the GenderedNews project that looks at gender inequality in French newstexts. Please refer to the latest version of this article for details on how these metrics are computed. This repository is divided in two parts, one for each tool: - Mentions masculinity computing - Citation masculinity computing

Mentions masculinity

The mentions masculinity computes what we call the masculinity rate of first names mentioned in (a) given newstext(s). The mentions_masc_computing contains the code that identifies first names (based on Named Entity Recognition) in a text and assigns a masculinity rate (derived from the INSEE first names database).

Citations masculinity

The citation masculinity computes the proportion of men quoted in (a) given newstext(s). The citation_masc_computing contains the code to extract quotes then genderize the extracted speakers. It is based on an adaptation of the REBEL framework on French Quotation Extraction (see our article published at LREC-COLING for more details).

Setting up environments

For simplicity, we advise to create a separate environment for each tool. See respective README.mds and requirements.txts for instruction on how to set up and run each code.

License

This project is licensed under the GNU Affero General Public License v3.0 License - see the LICENSE file for details

Acknowledgments

First versions of these tools were implemented by Gilles Bastin (for mentions masculinity) and by Laura Alonzo (for citation masculinity).

Owner

Login: ange-richard
Kind: user

Repositories: 1
Profile: https://github.com/ange-richard

Citation (citations_masc_computing/README.md)

# Computing Quotes Masculinity

## Setting up your environment

This repository is a simplified and minimal version of the [REBEL repository](https://github.com/Babelscape/rebel/tree/main), which is the framework we adapted to train our own quote extraction model for French. Please refer to [our article](https://aclanthology.org/2024.lrec-main.654/) published at LREC-COLING 2024 for details on the corpus used.

Follow each step of the setup/running of the code with care.
All commands are run from the `citations_masc_computing/` directory.
### Create conda environment

Be aware that the REBEL framework needs a specific `python` version (<3.11) and a specific `pip` version (<24). Using latest versions of both libraries will result in conflicts with the necessary `pytorch-lightning` version. The following command contains the versions we use to run the code.

```shell
conda create -n "gn_citmasc" python=3.9.12 pip=20.3.1
conda activate gn_citmasc
```

### Install required libraries

```shell
pip install -r requirements.txt
```
Double-check that the `pytorch-lightning` installation has not overridden any other installed libraries such as `torch`. If necessary, install back the right versions individually as such:
```shell
pip install torch==1.11wathevs
```

### Prepare the repository

1. Download our citations extraction model on [Zenodo](https://zenodo.org/records/15228575) and place it in `src/checkpoint/`
2. Place your data in `data/`
3. Change inference file, output file and checkpoint path variables with the corresponding absolute paths in `conf/root_infer.yaml`

## Usage

Below is the detailed usage of the pipeline of quotation extraction and genderization.

### Preprocess your data

1. Modify the preprocessing script for your own data

The `preprocessing/data_to_rebel.py` is a template to transform your data into the desired REBEL input format. Please modify the script to read your input data in the dedicated spot in the script. Minimal required fields are `id` and `text`.
2. Format your data into the desired input for the model
```shell
cd preprocessing/
python data_to_rebel.py ../data/[yourdata.jsonl] ../data/[yourdata_REBELformat].jsonl
```
3. Split the data entries so they all fit into the allowed 512-tokens length input
```shell
cd ../src/utils/
python dataset_utils.py --input-dir ../data/[yourdata_REBELformat].jsonl --output-dir ../data/ [--output-file yourdata_REBELformat_512cuts.jsonl]
```
This script splits the entries longer than 512 tokens into several entries by cutting at the newline or punctuation mark closest to the 512 tokens mark. Default output file name is `[yourdata_REBELformat]_extended_dataset.jsonl`.

### Run prediction

```shell
python ../predict.py
```

Default is `cpu` use. You can change to `gpu` by passing it to the `accelerator` argument of the `Trainer`.

### Post process your data

Postprocessing step will allow you to:
- Reaggregate the entries that were split into 512-tokens long chunks at the preprocessing step
- Match back the predicted output to input ids
- Optional: Genderize the quote speakers

```shell
cd ../postprocessing/
python genderize_and_aggregate.py ../data/[prediction_output_filename].jsonl ../data/[yourdata_REBELformat]_extended_dataset.jsonl [--add-gender]
```
Please note that the two positional arguments are the predicted output filename and the file used as an input for prediction (your data after split on 512 tokens)
This will input two files:
- **[prediction_output_filename]_agg.jsonl**: Each entry is an input text. ids are added back to re-aggregated entries, and gender is added as a key to each quote if option add-gender was used
- **[prediction_output_filename]_quote-list.csv**: Each line is a quote (matched with text id)

GitHub Events

Total

Push event: 12
Pull request event: 1
Fork event: 1
Create event: 2

Last Year

Push event: 12
Pull request event: 1
Fork event: 1
Create event: 2

Dependencies

citations_masc_computing/requirements.txt pypi

Genderize ==0.3.1
Requests ==2.32.3
auto_mix_prep ==0.2.0
datasets ==2.14.5
gender_guesser ==0.4.0
hydra-core ==1.1.2
nltk ==3.7
numpy ==1.24.4
omegaconf ==2.1.2
pandas ==2.0.3
pytorch-lightning ==1.6.1
spacy ==2.3.5
torch ==1.11.0
transformers ==4.18.0

mentions_masc_computing/requirements.txt pypi

pandas ==2.0.3
spacy ==2.3.5

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science