Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.0%) to scientific vocabulary
Keywords
Repository
A language corpus creation tool for ReliefWeb
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 1
- Open Issues: 5
- Releases: 8
Topics
Metadata Files
README.md
Corpusama
About
Corpusama is a language corpus management tool that provides a semi-automated pipeline for creating corpora from the ReliefWeb database of humanitarian texts (managed by the United Nations Office for the Coordination of Humanitarian Affairs).
The goal of building language corpora from ReliefWeb is to study humanitarian discourse: what concepts exist among actors, how their usage changes over time, what's debated about them, their practical/ideological implications, etc.
Corpusama can build corpora with texts from the ReliefWeb API. Contact ReliefWeb to discuss acceptable uses for your project: users are responsible for how they access and utilize ReliefWeb. This software is for nonprofit research; it is offered to ensure the reproducibility of research and to build on previous work. Feel free to reach out if you have questions or suggestions.
General requirements
ReliefWeb is a large database with over 1 million humanitarian reports, each of which may include PDF attachments and texts in multiple languages. Upwards of 500 GB of space may be required for managing and processing files. Downloading this data at a reasonable rate takes weeks. Dependencies also require ~< 10 GB of space and benefit from a fast CPU and GPU (a GPU is needed for processing large amounts of data).
Basic installation
Clone this repo and install dependencies in in a virtual environment (tested on Python 3.12). These are the main packages: pip install click defusedxml nltk pandas PyMuPDF PyYAML requests stanza.
bash
python3 -m venv .venv
source ${PWD}/.venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Reproducing corpora
Corpora can be generated using the following code snippet. The notes below describe the process and important considerations in more detail.
```bash
INITIAL SETUP
git clone https://github.com/engisalor/corpusama && cd corpusama
get dependencies and models, unit test
bash rwcorporasetup.sh "
RUN PERIODICALLY TO GENERATE UP-TO-DATE CORPORA
generate EN, FR & ES corpora (.vert.xz) for a date range
bash rwcorporaupdate.sh "
On package setup: rw_corpora_setup.sh
- Clone the repo and CD to the directory.
- Stanza and NLTK models will be downloaded to
~/. These resources will be reused until updated manually. Requires ~<10 GB for dependencies and models. - After installing dependencies,
unittestwill run to ensure proper setup. As of2024/11/01no tests should fail. - See ReliefWeb's terms and conditions before using its service/data. An email is required for making API calls (stored in files ending in
*secret.yml).
On generating corpora: rw_corpora_update.sh
- This script produces a series of compressed vertical corpus files in
a CoNLLU-based format using Stanza NLP.
- fetches ReliefWeb data for a date range
- processes with Stanza NLP into CoNLLU
- converts to SkE-compatible vertical foramt
- runs a secondary pipeline to add sentence-level language metadata and UUIDs for doc and docx structures (technically optional)
- After completion,
.vert.xzfiles can be stored and used. The rest of the project can be deleted if CoNLLU and other files aren't desired. Downloaded data will keep accumulating for every run. Storing all of ReliefWeb takes ~500 GB and weeks to download/process. This script is intended for sequential, chronological updates only, not a mix of overlapping or dijointed dates. - This script reuses
config/reliefweb_2000+.ymlto define settings. A custom date range is supplied to define what texts to download. Use a YYYY-MM-DD format. The example below collects data for January 2020:
```bash
get a month of data
bash rwcorporaupdate.py "2020-01-01" "2020-02-01" ```
This produces these vertical files (and intermediate formats) as long as documents of each language are detected in the chosen date range:
reliefweb_en_2020-01-01_2020-02-01.1.txt.vert.xzreliefweb_fr_2020-01-01_2020-02-01.1.txt.vert.xzreliefweb_es_2020-01-01_2020-02-01.1.txt.vert.xz
- Vertical files are ready to be compiled in Sketch Engine. See these directories for corpus configuration files:
- https://github.com/engisalor/corpusama/tree/main/registry
- https://github.com/engisalor/corpusama/tree/main/registry_subcorp
- This script has been tested for use with a recent Core i7 laptop with an NVIDIA GPU. Testing is done on Fedora Linux and Ubuntu to a lesser extent. Default settings attempt to be reasonable and reduce errors, but a number of issues may arise throughout the process. Verifying output data at each phase and becoming familiar with the code base are recommended. If in doubt, please reach out.
See more in-depth explanations of software and data formats below.
Corpus sizes
|Name|ID|Types|Tokens|Docs| |-|-|-|-|-| |ReliefWeb English 2023 | rwen23 | 1,683,494,268 | 2,079,244,974 |884,528| |ReliefWeb French 2023 | rwfr23 | 210,112,455 | 248,413,974 | 109,592 | |ReliefWeb Spanish 2023 | rw_es23 | 125,983,910 | 150,712,952 | 79,697 |
- Dates cover January 1, 2000 through December 31, 2023 (until next update)
Corpus structures
<doc>delineates documents, which may be HTML text or extracted PDF text- contains the most pertinent metadata for linguistic analysis
doc.idrefers to the Report ID (which can be shared by its 1 HTML text and 0+ associated PDFs)doc.file_idrefers to PDF content for a report
<docx>is the XZ archive containing a set of documents<s>is a sentence boundarys.idrefers to the sentence number in a document, starting at 1s.langif implemented, refers to the Stanza language identification result for the sentence, with English, Spanish, French, and None as the possible values (None being sentences too short to analyze)
- a
refvalue is given fordocanddocxstructures, which may be a unique sequential number or a UUID (preferably the latter)
Subcorpora
doc_htmlanddoc_pdfsplit the corpus by document typelang_*splits the corpus by sentence language: mostly used to identify pockets of unwanted noise; does not strictly refer to each language ID result (see theregistry_subcorpfiles for precise definitions)source_singleandsource_multisplit the corpus by documents that have only one 'author' in thedoc.source__namevalue or multiple authors
Stanza
Stanza is a Python NLP package from Stanford. Models for languages may need to be downloaded with its download() function if this doesn't happen automatically.
Deprecated packages
Earlier versions relied on FreeLing and fastText NLP tools. See the history of this README for old installation instructions. These tools perform better on machines without a dedicated GPU, whereas Stanza can run on a CPU but more slowly.
Configuration files
Various settings are supplied to build corpora. The config/ directory stores YAML files with many settings. The config/reliefweb_2000+.yml is one example. It specifies a few things:
```yaml
the API where data is collected from
source: reliefweb
an SQL schema for the database that stores API data
schema: corpusama/database/schema/reliefweb.sql
the name of the database
dbname: data/reliefweb2000+.db
the column containing textual data (i.e., corpus texts)
textcolumn: bodyhtml
the daily maximum number of API calls
quota: 1000
a dictionary specifying how to throttle API calls
wait_dict: {"0": 1, "5": 49, "10": 99, "20": 499, "30": null}
API parameters used to generate calls
parameters:
In this case, reliefweb_2000+ has parameters to get all text-based reports (in any language) starting from 1 January 2000.
Each configuration file is accompanied by another that supplies secrets. It has the same filename but uses the suffix .secret.yml, e.g., reliefweb_2000+.secret.yml:
```yml
where to store PDF files locally
pdf_dir: /a/local/filepath/
API url and appname
url: https://api.reliefweb.int/v1/reports?appname=
To get started, make a secrets file for reliefweb_2000+ or design your own configuration files.
Usage example
This is an example to verify installation and describe the general workflow of corpus creation.
Corpusama
The corpusama package has all the modules needed to manage data collection and corpus preparation. It's controlled with a Corpus object. Modules can also be imported and used independently, but this is mostly useful for development purposes.
How to build a database:
```py from corpusama.corpus.corpus import Corpus
instantiate a Corpus object
corp = Corpus("config/reliefweb_2000+.yml")
this makes/opens a database in data/
the database is accessible via corp.db
view the database with a GUI application like https://sqlitebrowser.org/
(helpful for inspecting each stage of data manipulation)
get records that haven't been downloaded yet
corp.rw.getnewrecords(1) # this example stops after 1 API call
(downloads up to 1000 records chronologically)
download associated PDFs
corp.rw.get_pdfs()
(most reports don't have PDFs; some have several)
extract PDF text
corp.rw.extract_pdfs()
doesn't include OCR capabilities
extracts to TXT files in same parent dir as PDF
run language identification on texts
corp.makelangid("pdf") # TXT files extracted from PDFs corp.makelangid("raw") # HTML data stored within API responses
make corpus XML attributes for a language
corp.make_attribute("fr")
there should be a few French documents in the first 1000 API results (example breaks if otherwise)
export the combined texts into one TXT file
df = corp.export_text("fr")
produces reliefweb_fr.1.txt
this files can be processed with a pipeline to make a vertical file
```
Export format
The above workflow generates text files with ReliefWeb reports surrounded by <doc> XML tags, by default in batches of up to 10,000 reports per output file. Here is a fragment of a single report:
xml
<doc id="302405" file_id="0" country__iso3="pse" country__shortname="oPt" date__original="2009-03-25T00:00:00+00:00" date__original__year="2009" format__name="News and Press Release" primary_country__iso3="pse" primary_country__shortname="oPt" source__name="Palestinian Centre for Human Rights" source__shortname="PCHR" source__type__name="Non-governmental Organization" theme__name="Health|Protection and Human Rights" title="OPT: PCHR appeals for action to save lives of Gaza Strip patients" url="https://reliefweb.int/node/302405" >
The Palestinian Centre for human rights
(PCHR) is appealing to the Ministries of Health in the Ramallah and Gaza
Governments to take all possible measures to facilitate referrals for Gazan
patients who need urgent medical treatment outside Gaza.
The Centre is alarmed at the deterioration
of patient's health following two key political decisions on healthcare
provision. In January 2009, the Ramallah Ministry of Health (MOH) cancelled
financial coverage for all Palestinian patients in Israeli hospitals, including
those who in the middle of long term treatment.
[...]
</doc>
Once the content of these files is inspected, they can be compressed with xz or a similar tool before further processing.
Pipelines
TLDR
These are the main commands for using the Stanza pipeline to process the above-mentioned TXT files. Use the --help flag to learn more.
base script:
python3 ./pipeline/stanza/base_pipeline.pygenerate CoNLLU files:
python3 ./pipeline/stanza/base_pipeline.py to-conllconvert CoNLLU to vertical:
python3 ./pipeline/stanza/base_pipeline.py conll-to-vert
More details
Files in the pipeline/ directory are used to complete corpus creation. The current version of Corpusama relies on the Stanza pipeline. Run python pipeline/stanza/base_pipeline.py --help for an overview. This has been tested with an NVIDIA 4070m GPU (8 GB). Also adjust the arguments below in stanza.Pipeline when fine-tuning memory management:
py
self.nlp = stanza.Pipeline(
tokenize_batch_size=32, # Stanza default=32
mwt_batch_size=50, # Stanza default=50
pos_batch_size=100, # Stanza default=5000
lemma_batch_size=50, # Stanza default=50
depparse_batch_size=5000, # Stanza default=5000
)
If possible, the dependency parsing batch size should be "set larger than the number of words in the longest sentence in your input document" (see documentation). Managing memory issues in the GPU-based Stanza pipeline may be necessary. See error_corrections.md for a list of steps taken to build the corpora with an Nvidia 4070.
The first output format is .conllu (see https://universaldependencies.org/format.html). Here's a sample:
```bash
newdoc
id = 302405
file_id = 0
country__iso3 = pse
country__shortname = oPt
date__original = 2009-03-25T00:00:00+00:00
dateoriginalyear = 2009
format__name = News and Press Release
primarycountry_iso3 = pse
primarycountry_shortname = oPt
source__name = Palestinian Centre for Human Rights
source__shortname = PCHR
sourcetypename = Non-governmental Organization
theme__name = Health|Protection and Human Rights
title = OPT: PCHR appeals for action to save lives of Gaza Strip patients
url = https://reliefweb.int/node/302405
text = The Palestinian Centre for human rights (PCHR) is appealing to the Ministries of Health in the Ramallah and Gaza Governments to take all possible measures to facilitate referrals for Gazan patients who need urgent medical treatment outside Gaza.
sent_id = 0
1 The the DET DT Definite=Def|PronType=Art 3 det _ startchar=0|endchar=3 2 Palestinian Palestinian ADJ NNP Degree=Pos 3 amod _ startchar=4|endchar=15 3 Centre Centre PROPN NNP Number=Sing 11 nsubj _ startchar=16|endchar=22 [...] ```
CoNLLU can then be converted to a vertical format recognized by Sketch Engine:
xml
<doc id="302405" file_id="0" country__iso3="pse" country__shortname="oPt" date__original="2009-03-25T00:00:00+00:00" date__original__year="2009" format__name="News and Press Release" primary_country__iso3="pse" primary_country__shortname="oPt" source__name="Palestinian Centre for Human Rights" source__shortname="PCHR" source__type__name="Non-governmental Organization" theme__name="Health|Protection and Human Rights" title="OPT: PCHR appeals for action to save lives of Gaza Strip patients" url="https://reliefweb.int/node/302405">
<s id="0">
1 The the DET DT Definite=Def|PronType=Art 3 det _ start_char=0|end_char=3
2 Palestinian Palestinian ADJ NNP Degree=Pos 3 amod _ start_char=4|end_char=15
3 Centre Centre PROPN NNP Number=Sing 11 nsubj _ start_char=16|end_char=22
The TXT, CoNLLU and vertical formats have their own use cases for NLP tasks. Viewing the corpus (e.g., in Sketch Engine) also requires making a corpus configuration file and other steps beyond this introduction.
Generating and verifying checksums is also recommended for sharing and versioning:
```bash
generate
sha256sum reliefweb* > hashes.txt
verify
sha256sum -c hashes.txt ```
Resources
- Stanza Treebanks
- Universal Dependencies
- Ancora treebank documentation (Spanish pipeline)
- French GSD treebank documentation (French pipeline)
- Penn treebank tagset (English pipeline)
Acknowledgements
Support comes from the Humanitarian Encyclopedia and the LexiCon research group at the University of Granada. See the paper below for references and funding disclosures.
Dependencies have included these academic works, which have their corresponding bibliographies: Stanza, FreeLing and fastText.
Subdirectories with a ske_ prefix are from Sketch Engine under AGPL, MPL, and/or other pertinent licenses. See their bibliography and website with open-source corpus tools.
Other attributions are indicated in individual files. NoSketch Engine is a related software used downstream in some projects.
Citation
Please consider citing the papers below. CITATION.cff can be used if citing the software directly.
```bibtex @inproceedings{isaacshumanitarian2023, location = {Brno, Czech Republic}, title = {Humanitarian reports on {ReliefWeb} as a domain-specific corpus}, url = {https://elex.link/elex2023/wp-content/uploads/elex2023_proceedings.pdf}, pages = {248--269}, booktitle = {Electronic lexicography in the 21st century. Proceedings of the {eLex} 2023 conference}, publisher = {Lexical Computing}, author = {Isaacs, Loryn}, editor = {Medveď, Marek and Měchura, Michal and Kosem, Iztok and Kallas, Jelena and Tiberius, Carole and Jakubíček, Miloš}, date = {2023} }
@inproceedings{isaacs-etal-2024-humanitarian-corpora, title = "Humanitarian Corpora for {E}nglish, {F}rench and {S}panish", author = "Isaacs, Loryn and Chamb{\'o}, Santiago and Le{\'o}n-Ara{\'u}z, Pilar", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", location = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.738", pages = "8418--8426", } ```
Owner
- Login: engisalor
- Kind: user
- Repositories: 2
- Profile: https://github.com/engisalor
Citation (CITATION.cff)
cff-version: 1.2.0
message: If you this software, please cite it as below and cite relevant papers indicated in README.md.
authors:
- family-names: Isaacs
given-names: Loryn
orcid: https://orcid.org/0000-0003-0267-4853
title: Corpusama
version: 0.4.0 # x-release-please-version
date-released: 2023-04-17
repository-code: https://github.com/engisalor/corpusama
license: GPL3+
GitHub Events
Total
- Create event: 1
- Release event: 1
- Issues event: 2
- Issue comment event: 1
- Push event: 22
- Pull request event: 2
- Pull request review event: 1
Last Year
- Create event: 1
- Release event: 1
- Issues event: 2
- Issue comment event: 1
- Push event: 22
- Pull request event: 2
- Pull request review event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 6
- Total pull requests: 12
- Average time to close issues: 5 months
- Average time to close pull requests: about 1 month
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.67
- Merged pull requests: 11
- Bot issues: 0
- Bot pull requests: 12
Past Year
- Issues: 4
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 1
Top Authors
Issue Authors
- engisalor (6)
Pull Request Authors
- github-actions[bot] (12)