Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: ange-richard
  • License: agpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 292 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

GenderedNews Tools

This repository contains the code to the tools used to compute two metrics used in the GenderedNews project that looks at gender inequality in French newstexts. Please refer to the latest version of this article for details on how these metrics are computed. This repository is divided in two parts, one for each tool: - Mentions masculinity computing - Citation masculinity computing

Mentions masculinity

The mentions masculinity computes what we call the masculinity rate of first names mentioned in (a) given newstext(s). The mentions_masc_computing contains the code that identifies first names (based on Named Entity Recognition) in a text and assigns a masculinity rate (derived from the INSEE first names database).

Citations masculinity

The citation masculinity computes the proportion of men quoted in (a) given newstext(s). The citation_masc_computing contains the code to extract quotes then genderize the extracted speakers. It is based on an adaptation of the REBEL framework on French Quotation Extraction (see our article published at LREC-COLING for more details).

Setting up environments

For simplicity, we advise to create a separate environment for each tool. See respective README.mds and requirements.txts for instruction on how to set up and run each code.

License

This project is licensed under the GNU Affero General Public License v3.0 License - see the LICENSE file for details

Acknowledgments

First versions of these tools were implemented by Gilles Bastin (for mentions masculinity) and by Laura Alonzo (for citation masculinity).

Owner

  • Login: ange-richard
  • Kind: user

Citation (citations_masc_computing/README.md)

# Computing Quotes Masculinity

## Setting up your environment

This repository is a simplified and minimal version of the [REBEL repository](https://github.com/Babelscape/rebel/tree/main), which is the framework we adapted to train our own quote extraction model for French. Please refer to [our article](https://aclanthology.org/2024.lrec-main.654/) published at LREC-COLING 2024 for details on the corpus used.

Follow each step of the setup/running of the code with care.
All commands are run from the `citations_masc_computing/` directory.
### Create conda environment

Be aware that the REBEL framework needs a specific `python` version (<3.11) and a specific `pip` version (<24). Using latest versions of both libraries will result in conflicts with the necessary `pytorch-lightning` version. The following command contains the versions we use to run the code.

```shell
  conda create -n "gn_citmasc" python=3.9.12 pip=20.3.1
  conda activate gn_citmasc
```

### Install required libraries

```shell
  pip install -r requirements.txt
```
Double-check that the `pytorch-lightning` installation has not overridden any other installed libraries such as `torch`. If necessary, install back the right versions individually as such:
```shell
    pip install torch==1.11wathevs
```

### Prepare the repository

1. Download our citations extraction model on [Zenodo](https://zenodo.org/records/15228575) and place it in `src/checkpoint/`
2. Place your data in `data/`
3. Change inference file, output file and checkpoint path variables with the corresponding absolute paths in `conf/root_infer.yaml`

## Usage

Below is the detailed usage of the pipeline of quotation extraction and genderization.

### Preprocess your data

1. Modify the preprocessing script for your own data

    The `preprocessing/data_to_rebel.py` is a template to transform your data into the desired REBEL input format. Please modify the script to read your input data in the dedicated spot in the script. Minimal required fields are `id` and `text`.
2. Format your data into the desired input for the model
```shell
    cd preprocessing/
    python data_to_rebel.py ../data/[yourdata.jsonl] ../data/[yourdata_REBELformat].jsonl
```
3. Split the data entries so they all fit into the allowed 512-tokens length input
```shell
cd ../src/utils/
python dataset_utils.py --input-dir ../data/[yourdata_REBELformat].jsonl --output-dir ../data/ [--output-file yourdata_REBELformat_512cuts.jsonl]
```
This script splits the entries longer than 512 tokens into several entries by cutting at the newline or punctuation mark closest to the 512 tokens mark. Default output file name is `[yourdata_REBELformat]_extended_dataset.jsonl`.

### Run prediction

```shell
    python ../predict.py
```

Default is `cpu` use. You can change to `gpu` by passing it to the `accelerator` argument of the `Trainer`.

### Post process your data

Postprocessing step will allow you to:
- Reaggregate the entries that were split into 512-tokens long chunks at the preprocessing step
- Match back the predicted output to input ids
- Optional: Genderize the quote speakers

```shell
cd ../postprocessing/
python genderize_and_aggregate.py ../data/[prediction_output_filename].jsonl ../data/[yourdata_REBELformat]_extended_dataset.jsonl [--add-gender]
```
Please note that the two positional arguments are the predicted output filename and the file used as an input for prediction (your data after split on 512 tokens)
This will input two files:
- **[prediction_output_filename]_agg.jsonl**: Each entry is an input text. ids are added back to re-aggregated entries, and gender is added as a key to each quote if option add-gender was used
- **[prediction_output_filename]_quote-list.csv**: Each line is a quote (matched with text id)


GitHub Events

Total
  • Push event: 12
  • Pull request event: 1
  • Fork event: 1
  • Create event: 2
Last Year
  • Push event: 12
  • Pull request event: 1
  • Fork event: 1
  • Create event: 2

Dependencies

citations_masc_computing/requirements.txt pypi
  • Genderize ==0.3.1
  • Requests ==2.32.3
  • auto_mix_prep ==0.2.0
  • datasets ==2.14.5
  • gender_guesser ==0.4.0
  • hydra-core ==1.1.2
  • nltk ==3.7
  • numpy ==1.24.4
  • omegaconf ==2.1.2
  • pandas ==2.0.3
  • pytorch-lightning ==1.6.1
  • spacy ==2.3.5
  • torch ==1.11.0
  • transformers ==4.18.0
mentions_masc_computing/requirements.txt pypi
  • pandas ==2.0.3
  • spacy ==2.3.5