https://github.com/amazon-science/idiom-mt

https://github.com/amazon-science/idiom-mt

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 16 MB
Statistics
  • Stars: 7
  • Watchers: 13
  • Forks: 2
  • Open Issues: 4
  • Releases: 0
Created over 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Idiom-MT

If you use this repository for your research, please cite:

Automatic Evaluation and Analysis of Idioms in Neural Machine Translation. Christos Baziotis, Prashant Mathur, Eva Hasler. EACL 2023.

The goal of the project is to address the problem of literal translations in machine translation systems. This problem is particularly pronounced during the translation of idiomatic expressions, such as “couch potato” or “once in a blue moon”, which tend to be translated word-for-word into the target language. This project contains the following:

  1. Methods for targeted and automatic evaluation of idioms in context.
  2. Models that are more robust to literal translations.
  3. Analysis of translation models that explores how different models represent idiomatic expressions, by varying the available context, as well as how these different representations are reflected in the output (translation) of the machine translation system.

Project Structure

This is how the codebase is organized.

```text Idiom-MT ├── analysis # Contains code and jupyter notebooks with exploratory analysis ├── checkpoints # Contains the checkpoints of all pretrained and finetuned models ├── data # Contains the raw + preprocessed data used in our experiments ├── data-bin # Contains the data in binarized form for faireq training ├── experiments # Contains the scripts for launching experiments + their logs ├── literaltranslatability # The package for estimating the literal translatability in parallel data ├── metrics # The packages with the evaluation metrics developed as part of the project ├── phraseextractor # The package with the phrase-matching sentence extraction tool ├── prototype # ignore random scripts for prototyping models and ideas ├── tools # Third-party tools, such as Moses, fast_align etc. ├── user # The fairseq's --user-dir, which contains all our custom fairseq code (plugins+extensions) ├── utils # Helper scripts, such as for colleting results of filtering parallel data ├── tok # Contains the (sentencepiece) tokenized data (from ./data/) └── vocab # Contains the sentencepiece models used for tokenization (./data/ --> ./tok/)

```

Prerequisites

Install Requirements

1. Create Environment (Optional): Ideally, you should create an environment for the project. Use python=3.7 because at the time of writing this document there are some issues that prevent remote development with Pycharm from Mac OS.

conda create -n idiom-mt python=3.7 conda activate idiom-mt

2. Install PyTorch (guide) with the desired Cuda version if you want to use the GPU:

shell pip install torch torchvision torchaudio

IMPORTANT: for A100 do the following:

shell conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

3. Install dependencies: Then, install the rest of the requirements:

pip install -r requirements.txt

Not required (you can skip that): If you want to install apex for faster fairseq training, you may encounter compilation issues. If the instructions in fairseq's repository don't work for you, you could try this command:

text CUDA_HOME=/usr/local/cuda-11.1 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Check first what version you have installed in /usr/local/ and then use the one that matches with your PyTorch installation. Weirdly, for me ( and others) it worked with cuda-11.1 even though I had PyTorch with cuda-11.0.

4. SpaCy models: Then install the language support models for Spacy from https://spacy.io/usage/models. For example, for English run:

```shell for lang in en zh; do python -m spacy download ${lang}coreweb_sm done

for lang in fr de el it pt es ru; do python -m spacy download ${lang}corenews_sm done ```

5. Third-party libraries: Finally, install the third-party libraries for preprocessing, such Moses:

shell bash install-tools.sh

(Optional) Test environment with mBART fine-tuning

First, download the EN-DE News Commentary v13 dataset:

shell bash download_data_prototype.sh

Then, download mBART:

shell bash download_mbart.sh

Then, preprocess the data for mBART finetuning:

shell ./prepare_data_mbart_finetuning.sh \ data/news_commentary_en_de_v13 \ tok/news_commentary_en_de_v13_mbart \ data-bin/news_commentary_en_de_v13_mbart \ train dev test \ en de \ en_XX de_DE

Then, preprocess the data for randomly initialized NMT training:

shell ./prepare_data_translation.sh \ data/news_commentary_en_de_v13 \ tok/news_commentary_en_de_v13_random \ data-bin/news_commentary_en_de_v13_random \ train dev test \ en de

Reproduce the Experiments

(If you don't have time for reading this, check prepare_data_TLDR.sh)

Step 1. Download and prepare the training/dev/test splits

First, you need to download and prepare the data that will be used for the NMT experiments. Use prepare_data.sh and just specify the language pair that you want to use, and the idiom list, which is going to affect the idiom test data as well as the regular-vs-idiom training data splits.

shell SRC_LANG=en TRG_LANG=fr IDIOMS_LIST=./data/idioms_data/all_idioms.en bash prepare_data.sh $SRC_LANG $TRG_LANG $IDIOMS_LIST

Details of how the script works and how it organized the data (You can skip this)

This script does the following:

  1. downloads the (WMT) testsets and the (WMT/Europarl) parallel data in data/parallel_en_fr/.
  2. filters the parallel data based on length and discard very uneven sentences data/parallel_en_fr ├── dev.en ├── dev.fr ├── test.en ├── test.fr ├── train.en ├── train.en.clean ├── train.fr └── train.fr.clean

  3. Extracts sentence pairs from the parallel data that have an idiom in the source-side, based on a given idiom list in data/parallel_en_fr.idioms.en/. Check the phrase_extractor/phrase_extractor.py tool for details. data/parallel_en_fr.idioms.en ├── annotations.tsv ├── matches.log ├── samples ├── sentences.txt ├── spans.txt └── stats.txt

  4. Splits the parallel data (from 1.) into two groups in data/parallel_en_fr.idioms.en.splits/:

    • ordinary: these are pairs that don't have an idiom in the source side
    • special: these are pairs that DO have an idiom in the source side This step uses phrase_extractor/split_data_by_line_id.py. data/parallel_en_fr.idioms.en.splits ├── ordinary.en ├── ordinary.fr ├── special.en ├── special.fr └── special.spans.en
  5. Separates the special pairs into train and test splits. This step uses utils/train_test_split_idiom_pairs.py data/parallel_en_fr.idioms.en.splits/splits ├── discarded.en ├── discarded.fr ├── discarded.spans.en ├── freqs.txt ├── test.en ├── test.fr ├── test.spans.en ├── train.en ├── train.fr └── train.spans.en

  6. Creates the idiom test set in parallel_en_fr.idiom_test, the idiom training split in data/parallel_en_fr.idiom_train and the train/dev/test split with the regular parallel data in data/parallel_en_fr.regular, excluding those that went into the idiom train/test data. data/parallel_en_fr.idiom_test ├── test.en ├── test.fr └── test.spans.en data/parallel_en_fr.idiom_train # this can also be used for upsampling ├── train.en ├── train.fr └── train.spans.en data/parallel_en_fr.regular ├── dev.en ├── dev.fr ├── test.en ├── test.fr ├── train.en └── train.fr

Step 2. Binarize the parallel data

This step binarizes the train/dev/test data for training with Fairseq. We need to binarize/preprocess the data in a different way for each training process.

NMT from random initialization. This involves training a joint sentencepiece model on the source+target training data, segmenting the text with that sentencepiece model and finally binarizing the data. See the source code for the script usage instructions.

```shell

First, train SPM and binarize the regular training data

bash preparedatatranslation.sh \ ./data/parallelenfr.regular \ ./tok/parallelenfr.regular.random \ ./data-bin/parallelenfr.regular.random \ train dev test en fr

Next, reuse the pretrained SPM and binarize the idiom training data

bash preparedatatranslation.sh \ ./data/parallelenfr.idiomtrain \ ./tok/parallelenfr.idiomtrain.random \ ./data-bin/parallelenfr.idiomtrain.random \ train '' '' en fr \ ./vocab/parallelen_fr.regular ```

Also, create a split which contains both the regular and the idiom data, by symlinking them into data-bin/parallel_en_fr.regular+idiom.random.

shell bash combine_data.sh \ data-bin/parallel_en_fr.idiom_train.random \ data-bin/parallel_en_fr.regular.random \ data-bin/parallel_en_fr.regular+idiom.random \ en fr

NMT from mBART initialization. This involves segmenting the text mBART's sentencepiece model and then binarizing the data. See the source code for the script usage instructions.

```shell

tokenize and binarize the different splits

for split in regular idiomtrain; do bash preparedatambartfinetuning.sh \ ./data/parallelenfr.${split} \ ./tok/parallelenfr.${split}.mbart \ ./data-bin/parallelenfr.${split}.mbart \ train dev test enXX frXX done

symlink the regular and idiom data to create the joint split

bash combinedata.sh \ data-bin/parallelenfr.idiomtrain.mbart \ data-bin/parallelenfr.regular.mbart \ data-bin/parallelenfr.regular+idiom.mbart \ enXX frXX ```

Step 3. Launch the experiments

For running experiments, read the documentation in ./experiments/README.md.

Checkpoints and results

The checkpoints of each model are saved in the ./checkpoints/ directory. Inside each model's folder, you will find the checkpoints (last.pt and best.pt) for that particular model, and all the model outputs and scores. Here is an example of the structure for the model enfr_joint.random:

  • ./checkpoints/enfr_joint.random/: besides the checkpoints you will find the model outputs and scores for the generic MT eval
  • ./checkpoints/enfr_joint.random/parallel_en_fr.idiom_test/: model outputs and scores for the idiom-specific eval
  • ./checkpoints/enfr_joint.random/analysis/: model outputs and scores for all the analysis methods. The outputs.json contains all the results together.

Results To collect all the results, run: shell bash collect_results.sh

It will save the results in .csv files, organized by language pair and experiments-vs-analysis: - enfr.results.analysis.csv - enfr.results.experiments.csv

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 2
  • Total pull requests: 8
  • Average time to close issues: about 7 hours
  • Average time to close pull requests: 7 days
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.63
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 6
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • leileilin (1)
Pull Request Authors
  • dependabot[bot] (4)
  • shuoyangd (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (4)