https://github.com/crim-ca/wiki-bias

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: researchgate.net
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: crim-ca
License: mit
Language: Python
Default Branch: master
Size: 172 MB

Statistics

Stars: 4
Watchers: 2
Forks: 0
Open Issues: 1
Releases: 0

Created almost 7 years ago · Last pushed about 3 years ago

Metadata Files

Readme License

wiki-bias

This repository contains code for the paper Multilingual Sentence-Level Bias Detection in Wikipedia, as well as three datasets for the task of bias detection.

Prerequisites

Python 3.6 or later
All dependencies pip install -r /path/to/requirements.txt

Usage

1. URLs

Extract the urls of all parts of the complete page edit history dump. While dumps of small Wikipedias (like the one in Bulgarian) come in a single file, the large ones (English, French, etc.) are split into multiple smaller files.

Make sure to select an existing date from the list on https://dumps.wikimedia.org/{lang}wiki/ and then verify the presence of a dump called All pages with complete page edit history for this particular date.

```bash python url_extractor.py -o -l -d

for example

python url_extractor.py -o urls.txt -l fr -d 20191001

```

2. Download and extraction of revision pairs

Download all parts of the dump and extract the relevant revision pairs.

Attention! The download and on-the-fly processing of highly compressed dump files requires time. Consider parallelizing this step if you need to process large Wikis split into multiple files.

```bash python filter.py -i -o -l

for example

python filter.py -i urls.txt -o revisions.txt -l fr

```

3. Preprocessing and diff check

Preprocessing, segmentation, cleanup, diff check, filtering. ```bash python diff.py -i -o -l

for example

python diff.py -i revisions.txt -o diffs.pickle -l fr

```

4. Sentence extraction

Sentence extraction, duplicates cleanup, classes, balancing. ```bash python sents.py -i -o

for example

python sents.py -i diffs.pickle -o sents.pickle

```

5. Labeling and splitting

Class labels, dataset split. ```bash python dataset.py -i -l (-p )

for example

python dataset.py -i sents.pickle -l fr -p label

```

Datasets

Balanced and split datasets in Bulgarian, French and English (extracted from dumps 20190401) can be found in /datasets/

Citation

@InProceedings{aleksandrovamultilingual, author = "Aleksandrova, Desislava and Lareau, Fran{\c{c}}ois and M{\'e}nard, Pierre-Andr{\'e}}", title = "Multilingual Sentence-Level Bias Detection in Wikipedia", booktitle = "Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2019)", year = "2019", publisher = "Association for Computational Linguistics", pages = "42--51", location = "Varna, Bulgaria" }

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/crim-ca/wiki-bias

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

wiki-bias

Prerequisites

Usage

1. URLs

for example

python url_extractor.py -o urls.txt -l fr -d 20191001

2. Download and extraction of revision pairs

for example

python filter.py -i urls.txt -o revisions.txt -l fr

3. Preprocessing and diff check

for example

python diff.py -i revisions.txt -o diffs.pickle -l fr

4. Sentence extraction

for example

python sents.py -i diffs.pickle -o sents.pickle

5. Labeling and splitting

for example

python dataset.py -i sents.pickle -l fr -p label

Datasets

Citation

Owner

GitHub Events

Total

Last Year