paracrawl-context

Extracting parallel data with original document context from raw ParaCrawl data

https://github.com/proyag/paracrawl-context

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Extracting parallel data with original document context from raw ParaCrawl data

Basic Info

Host: GitHub
Owner: Proyag
License: mit
Language: Shell
Default Branch: main
Size: 37.1 KB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 2
Releases: 0

Created over 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

ParaCrawl-Context

Extracting parallel data with original document context from raw ParaCrawl data.
For details, see the associated ACL 2024 paper: Document-Level Machine Translation with Large-Scale Public Parallel Corpora.

Released data

Datasets for 5 language pairs have been released at https://huggingface.co/datasets/Proyag/paracrawl_context.
If you want to compile your own datasets, read on.

Environment setup

Create a conda env bash conda create -n context python=3.10 conda activate context
Install Python packages and command-line tools bash pip install -r requirements.txt sudo apt install parallel pigz pv

Required files

For all language pairs: * classified-fasttext: Raw URLs and base64 encoded documents from ParaCrawl crawls that the corpora were extracted from. These can be downloaded from the "Language Classified Web Text from ParaCrawl" section on this page. Not everything was released here, so our process will be lossy.

Per language pair: * TMX file: TMX files from official ParaCrawl releases.
* TSV file: TSV file from official ParaCrawl releases. This contains all the parallel text that's also in the TMX files, but this format is convenient for some steps.
You can get both required files with ./download-paracrawl.sh SRC TRG. Use three-letter ISO 639-2 language codes. If only know two-letter ISO 639-1 codes, use ./lang_codes.sh TWO_LETTER_CODE to get the three-letter code. If links in download-paracrawl.sh are broken, you can download the TMX and TSV files from https://paracrawl.eu, place them in data/released, and rename them to use three-letter language codes.

Run the document extraction pipeline

Step 1: Extract URLs

In this step, we extract the URLs and corresponding line numbers from the TMX file. Note that each line can be associated to one or many URLs where the line was present, and each URL might be associated with many line numbers.

bash ./extract-urls.sh SRC TRG

Step 2: Run join

This step basically joins the extracted URLs with the {URL, document} pairs from the classified-fasttext data.

First, edit the RAWDATA_DIR variable in run-join.sh to point to your copy of classified-fasttext. Then run ./run-join.sh SRC TRG COLLECTION LANG, COLLECTION is one of the collections in classified-fasttext like wide00006 or philipp, and LANG is either SRC or TRG. You can also loop over all the collections and both LANGs, but this step is very heavy and long-running.

For example, bash ./run-join.sh eng mlt wide00006 mlt

To run everything, run bash for collection in GWB-20191109192916 hieu marta philipp wide00006 wide00015 wide00016; do for lang in SRC TRG; do ./run-join.sh SRC TRG ${collection} ${lang} done done but remember that this will take a long time to run for most language pairs.

Step 3: Extract contexts

The main part of this process is highly parallelisable. get-context.sh can be run in two ways: locally or as a job array on a SLURM cluster.

Usage: ./get-contexts.sh [-n N_JOBS] [-s] [-a SLURM_ARGS...] [-c CONTEXT] [-f] SRC TRG

Arguments are: * -s: Enables SLURM mode. Run locally if not provided. * -n N_JOBS: Number of parallel jobs if run locally, otherwise number of parallel jobs per SLURM array job. Default: 4. * -a ARGS: SLURM job arguments. Remember to wrap in quotes. For example, -a "-A ACCOUNT -p PARTITION --nodes 1 --time 6:00:00" * -c CONTEXT: Number of tokens per retrieved context (including special sentence delimiter tokens). Default: 512. * -f: Force re-splitting joined files. Useful if splitting was interrupted.

NOTE: Each job will hold one side of the sentence-level parallel corpus in memory, so take that into account when choosing N_JOBS.

Output data

The final output files can be found in data/contexts_per_line/SRC-TRG.{SRC,TRG}.context512.per_line.gz. These are gzipped TSV files where the columns are line_number, URL, sentence, context. You can use the line numbers to match these with the lines from the original ParaCrawl TMX/TSV file. The context field has up to 1000 contexts (the same line may have come from many different sources) separated by ||| as a delimiter by default. Line breaks in the original context have been replaced by a special <docline> token.

Owner

Name: Proyag Pal
Login: Proyag
Kind: user
Location: Edinburgh, Scotland
Company: @EdinburghNLP

Website: proyag.github.io
Twitter: ProyagPal
Repositories: 19
Profile: https://github.com/Proyag

PhD student at the University of Edinburgh. Working on machine translation

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Pal"
    given-names: "Proyag"
  - family-names: "Birch"
    given-names: "Alexandra"
  - family-names: "Heafield"
    given-names: "Kenneth"
title: "ParaCrawl-Context"
version: 1.0.0
date-released: 2024-05-22
url: "https://github.com/github-linguist/linguist"
preferred-citation:
  type: conference-paper
  authors:
    - family-names: "Pal"
      given-names: "Proyag"
    - family-names: "Birch"
      given-names: "Alexandra"
    - family-names: "Heafield"
      given-names: "Kenneth"
  # doi: "10.0000/00000"
  collection-title: "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics"
  conference:
    name: 62nd Annual Meeting of the Association for Computational Linguistics
    city: Bangkok
    country: Thailand
    date-start: 2024-08-11
    date-end: 2024-08-16
  month: 8
  # start: # First page number
  # end: # Last page number
  title: "Document-Level Machine Translation with Large-Scale Public Parallel Corpora"
  year: 2024

GitHub Events

Total

Watch event: 1
Fork event: 1

Last Year

Watch event: 1
Fork event: 1

Dependencies

requirements.txt pypi

lxml *
pgzip *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science