citationdetective

A system for data export from Citation Need model detecting unsourced sentences on Wikipedia

https://github.com/aikochou/citationdetective

Science Score: 28.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

A system for data export from Citation Need model detecting unsourced sentences on Wikipedia

Basic Info

Host: GitHub
Owner: AikoChou
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 55.7 KB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 10
Releases: 0

Created over 6 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

Citation Detective

Citation Detective is a system that applies the Citation Need model, a machine-learning-based classifier published in WWW'19 by WMF researchers and collaborators, to a large number of articles in English Wikipedia, producing a dataset that contains sentences detected as missing citations with their associated metadata.

Citation Detective database is now available on the Wikimedia Toolforge as the public SQL database s54245__citationdetective_p.

Every time we update the database, Citation Detective takes randomly about 120k articles in English Wikipedia, runs the Citation Need model for predicting a score that a sentence needs a citation. Citation Detective then extracts sentences with a score higher than 𝑦ˆ >= 0.5 along with contextual information, resulting in hundreds thousand sentences in the database which are classified as needing citations.

A design specification for the system can be found in this blog post and more information in our Wiki Workshop submission.

Schema of the Sentences table in Citation Detective database:

| Field | Type | Description | | --- | --- | --- | | id | integer | Primary key | | sentence | string | The text of the sentence | | paragraph | string | The text of the paragraph which contains the sentence | | section | string | The section title | | rev_id | integer | The revision ID of the article | | score | float | The predicted citation need score |

Access to the database in Toolforge

You need a developer account to access the database, create one and setup a SSH key refer to the instructions here.

After logging in to Toolforge server, connect to tools.db.svc.eqiad.wmflabs with the replica.my.cnf credentials: $ mysql --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.eqiad.wmflabs You could also just type: $ sql tools Access to Citation Detective database: MariaDB [(none)]> use s54245__citationdetective_p; Access to the database from outside the Toolforge environment is not currently possible, but is under investigation for the future.

Deploying in Toolforge

If you want to contribute to this project, please take a look the following instructions for deploying in Toolforge and running locally for development. :slightlysmilingface:

The job Citation Detective updates its database runs on the grid engine via Cron.

After logging in to Toolforge server, create a virtualenv and activate it: $ mkdir www/python/ $ virtualenv --python python3 www/python/venv/ $ . www/python/venv/bin/activate Next, clone this repository and install the dependencies: $ git clone https://github.com/AikoChou/citationdetective.git $ pip install -r citationdetective/requirements.txt Then, download the Citation Need models and embeddings following the instructions in citation-needed/ directory and put them into corresponded folders. File structure is like: . └── citation-needed/ └── embeddings/ └── word_dict_en.pck └── section_dict_en.pck └── models/ └── fa_en_model_rnn_attention_section.h5 Finaly, the scripts/update_db_tools_labs.py script automates the generation of the database in Toolforge. It is run regularly as a cron job and needs to run from a virtualenv. /usr/bin/jsub -mem 10g -N cd_update_en -once \ /data/project/citationdetective/www/python/venv/bin/python3 \ /data/project/citationdetective/citationdetective/scripts/update_db_tools_labs.py en

Generating the database locally

To generate your own Citation Detective database locally, you need * A local installation of MySQL; * A working Internet connection; * A list of Page IDs, which correspond to articles in Wikipedia that you want to use Citation Detective to generate "citation needed" scores for the sentences in articles.

First, set a MySQL config file to let the scripts know how to find and log in to the databases: (like the MySQL credentials in ~/replica.my.cnf in Toolforge) $ cat ~/replica.my.cnf [client] user='root' host='localhost' Citation Need model exist for English, Italian and French, and they can be retrained for any language. The scripts expect an environment variable CD_LANG to be set as a language code taken from config.py.

Since Citation Detective now only support Englich Wikipedia, we set the variable to en: $ export CD_LANG=en Now, let's create all necessary databases and tables: $ python -c 'import cddb; cddb.initialize_all_databases()' Change to scripts/ directory, run the parse.py script which will read the Page ID list you give and query the Wikipedia API for the actual content of the pages, run Citation Need model to identify sentences lacking citations: $ cd scripts $ python parse.py sample_pageids You can use the sample_pageids provided or generate one from print_pageids_from_wikipedia.py. For the later option, you need to download the page SQL dump of Wikipedia and import the dump in your local MySQL in advance.

Lastly, your MySQL installation should contain a database named root__scratch_en with sentences table. The install_new_database.py script will atomically move the table to a new database named root__citationdetective_p which serves as the final database. $ python install_new_database.py

Owner

Name: Aiko
Login: AikoChou
Kind: user
Location: Taiwan

Repositories: 7
Profile: https://github.com/AikoChou

Citation (citation-needed/embeddings/dictionaries.txt)

To run the models, you will need two dictionaries:
* Sentence dictionary: these are the embeddings for the words in the sentences
* Section dictionary: embeddings for the section titles.

Whatever words or sections are outside these two dictionaries, they will be assigned the UNK embedding. 

All dictionaries in their pickle format can be found here: https://drive.google.com/drive/folders/1dlocPHPz6Giv9nS8rR4t6kes8nlJ3inX?usp=sharing

The file format is:
<dict type>_dict_<lan>.pck

Where:
* <dict type> in [word,section]
* <lan> in [en,fr,it]

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

Keras ==2.1.5
Markdown ==3.1.1
PyYAML ==5.2
Werkzeug ==0.16.0
absl-py ==0.9.0
astor ==0.8.1
bleach ==1.5.0
certifi ==2019.11.28
chardet ==3.0.4
docopt ==0.6.2
gast ==0.3.2
grpcio ==1.26.0
h5py ==2.10.0
html5lib ==0.9999999
idna ==2.8
mwapi ==0.5.1
mwparserfromhell ==0.5.4
mysqlclient ==1.4.2.post1
nltk ==3.4.5
numpy ==1.18.0
pkg-resources ==0.0.0
protobuf ==3.11.2
psutil ==5.6.7
python-dateutil ==2.8.1
pytz ==2019.3
requests ==2.22.0
scipy ==1.4.1
six ==1.13.0
tensorboard ==1.7.0
tensorflow ==1.7.0
termcolor ==1.1.0
urllib3 ==1.25.7

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science