https://github.com/crim-ca/wiki-bias

https://github.com/crim-ca/wiki-bias

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: researchgate.net
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: crim-ca
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 172 MB
Statistics
  • Stars: 4
  • Watchers: 2
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created almost 7 years ago · Last pushed about 3 years ago
Metadata Files
Readme License

README.md

wiki-bias

This repository contains code for the paper Multilingual Sentence-Level Bias Detection in Wikipedia, as well as three datasets for the task of bias detection.

Prerequisites

  • Python 3.6 or later
  • All dependencies pip install -r /path/to/requirements.txt

Usage

1. URLs

Extract the urls of all parts of the complete page edit history dump. While dumps of small Wikipedias (like the one in Bulgarian) come in a single file, the large ones (English, French, etc.) are split into multiple smaller files.

Make sure to select an existing date from the list on https://dumps.wikimedia.org/{lang}wiki/ and then verify the presence of a dump called All pages with complete page edit history for this particular date.

```bash python url_extractor.py -o -l -d

for example

python url_extractor.py -o urls.txt -l fr -d 20191001

```

2. Download and extraction of revision pairs

Download all parts of the dump and extract the relevant revision pairs.

Attention! The download and on-the-fly processing of highly compressed dump files requires time. Consider parallelizing this step if you need to process large Wikis split into multiple files.

```bash python filter.py -i -o -l

for example

python filter.py -i urls.txt -o revisions.txt -l fr

```

3. Preprocessing and diff check

Preprocessing, segmentation, cleanup, diff check, filtering. ```bash python diff.py -i -o -l

for example

python diff.py -i revisions.txt -o diffs.pickle -l fr

```

4. Sentence extraction

Sentence extraction, duplicates cleanup, classes, balancing. ```bash python sents.py -i -o

for example

python sents.py -i diffs.pickle -o sents.pickle

```

5. Labeling and splitting

Class labels, dataset split. ```bash python dataset.py -i -l (-p )

for example

python dataset.py -i sents.pickle -l fr -p label

```

Datasets

Balanced and split datasets in Bulgarian, French and English (extracted from dumps 20190401) can be found in /datasets/

Citation

@InProceedings{aleksandrovamultilingual, author = "Aleksandrova, Desislava and Lareau, Fran{\c{c}}ois and M{\'e}nard, Pierre-Andr{\'e}}", title = "Multilingual Sentence-Level Bias Detection in Wikipedia", booktitle = "Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2019)", year = "2019", publisher = "Association for Computational Linguistics", pages = "42--51", location = "Varna, Bulgaria" }

Owner

  • Name: crim-ca
  • Login: crim-ca
  • Kind: organization

GitHub Events

Total
Last Year