https://github.com/amazon-science/idiom-mt
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: amazon-science
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 16 MB
Statistics
- Stars: 7
- Watchers: 13
- Forks: 2
- Open Issues: 4
- Releases: 0
Metadata Files
README.md
Idiom-MT
If you use this repository for your research, please cite:
Automatic Evaluation and Analysis of Idioms in Neural Machine Translation. Christos Baziotis, Prashant Mathur, Eva Hasler. EACL 2023.
The goal of the project is to address the problem of literal translations in machine translation systems. This problem is particularly pronounced during the translation of idiomatic expressions, such as “couch potato” or “once in a blue moon”, which tend to be translated word-for-word into the target language. This project contains the following:
- Methods for targeted and automatic evaluation of idioms in context.
- Models that are more robust to literal translations.
- Analysis of translation models that explores how different models represent idiomatic expressions, by varying the available context, as well as how these different representations are reflected in the output (translation) of the machine translation system.
Project Structure
This is how the codebase is organized.
```text Idiom-MT ├── analysis # Contains code and jupyter notebooks with exploratory analysis ├── checkpoints # Contains the checkpoints of all pretrained and finetuned models ├── data # Contains the raw + preprocessed data used in our experiments ├── data-bin # Contains the data in binarized form for faireq training ├── experiments # Contains the scripts for launching experiments + their logs ├── literaltranslatability # The package for estimating the literal translatability in parallel data ├── metrics # The packages with the evaluation metrics developed as part of the project ├── phraseextractor # The package with the phrase-matching sentence extraction tool ├── prototype # ignore random scripts for prototyping models and ideas ├── tools # Third-party tools, such as Moses, fast_align etc. ├── user # The fairseq's --user-dir, which contains all our custom fairseq code (plugins+extensions) ├── utils # Helper scripts, such as for colleting results of filtering parallel data ├── tok # Contains the (sentencepiece) tokenized data (from ./data/) └── vocab # Contains the sentencepiece models used for tokenization (./data/ --> ./tok/)
```
Prerequisites
Install Requirements
1. Create Environment (Optional): Ideally, you should create an environment
for the project. Use python=3.7 because at the time of writing this document
there are some issues that prevent remote development with Pycharm from Mac OS.
conda create -n idiom-mt python=3.7
conda activate idiom-mt
2. Install PyTorch (guide) with the desired Cuda version if you want to use the GPU:
shell
pip install torch torchvision torchaudio
IMPORTANT: for A100 do the following:
shell
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
3. Install dependencies: Then, install the rest of the requirements:
pip install -r requirements.txt
Not required (you can skip that): If you want to install apex for faster fairseq training, you may encounter compilation issues. If the instructions in fairseq's repository don't work for you, you could try this command:
text
CUDA_HOME=/usr/local/cuda-11.1 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Check first what version you have installed in /usr/local/ and then use the one
that matches with your PyTorch installation. Weirdly, for me (
and others)
it worked with cuda-11.1 even though I had PyTorch with cuda-11.0.
4. SpaCy models: Then install the language support models for Spacy from https://spacy.io/usage/models. For example, for English run:
```shell for lang in en zh; do python -m spacy download ${lang}coreweb_sm done
for lang in fr de el it pt es ru; do python -m spacy download ${lang}corenews_sm done ```
5. Third-party libraries:
Finally, install the third-party libraries for preprocessing, such Moses:
shell
bash install-tools.sh
(Optional) Test environment with mBART fine-tuning
First, download the EN-DE News Commentary v13 dataset:
shell
bash download_data_prototype.sh
Then, download mBART:
shell
bash download_mbart.sh
Then, preprocess the data for mBART finetuning:
shell
./prepare_data_mbart_finetuning.sh \
data/news_commentary_en_de_v13 \
tok/news_commentary_en_de_v13_mbart \
data-bin/news_commentary_en_de_v13_mbart \
train dev test \
en de \
en_XX de_DE
Then, preprocess the data for randomly initialized NMT training:
shell
./prepare_data_translation.sh \
data/news_commentary_en_de_v13 \
tok/news_commentary_en_de_v13_random \
data-bin/news_commentary_en_de_v13_random \
train dev test \
en de
Reproduce the Experiments
(If you don't have time for reading this, check prepare_data_TLDR.sh)
Step 1. Download and prepare the training/dev/test splits
First, you need to download and prepare the data that will be used for the NMT
experiments. Use prepare_data.sh and just specify the language pair that you
want to use, and the idiom list, which is going to affect the idiom test data as
well as the regular-vs-idiom training data splits.
shell
SRC_LANG=en
TRG_LANG=fr
IDIOMS_LIST=./data/idioms_data/all_idioms.en
bash prepare_data.sh $SRC_LANG $TRG_LANG $IDIOMS_LIST
Details of how the script works and how it organized the data (You can skip this)
This script does the following:
- downloads the (WMT) testsets and the (WMT/Europarl) parallel data
in
data/parallel_en_fr/. filters the parallel data based on length and discard very uneven sentences
data/parallel_en_fr ├── dev.en ├── dev.fr ├── test.en ├── test.fr ├── train.en ├── train.en.clean ├── train.fr └── train.fr.cleanExtracts sentence pairs from the parallel data that have an idiom in the source-side, based on a given idiom list in
data/parallel_en_fr.idioms.en/. Check thephrase_extractor/phrase_extractor.pytool for details.data/parallel_en_fr.idioms.en ├── annotations.tsv ├── matches.log ├── samples ├── sentences.txt ├── spans.txt └── stats.txtSplits the parallel data (from 1.) into two groups in
data/parallel_en_fr.idioms.en.splits/:- ordinary: these are pairs that don't have an idiom in the source side
- special: these are pairs that DO have an idiom in the source side This
step uses
phrase_extractor/split_data_by_line_id.py.data/parallel_en_fr.idioms.en.splits ├── ordinary.en ├── ordinary.fr ├── special.en ├── special.fr └── special.spans.en
Separates the special pairs into train and test splits. This step uses
utils/train_test_split_idiom_pairs.pydata/parallel_en_fr.idioms.en.splits/splits ├── discarded.en ├── discarded.fr ├── discarded.spans.en ├── freqs.txt ├── test.en ├── test.fr ├── test.spans.en ├── train.en ├── train.fr └── train.spans.enCreates the idiom test set in
parallel_en_fr.idiom_test, the idiom training split indata/parallel_en_fr.idiom_trainand the train/dev/test split with the regular parallel data indata/parallel_en_fr.regular, excluding those that went into the idiom train/test data.data/parallel_en_fr.idiom_test ├── test.en ├── test.fr └── test.spans.en data/parallel_en_fr.idiom_train # this can also be used for upsampling ├── train.en ├── train.fr └── train.spans.en data/parallel_en_fr.regular ├── dev.en ├── dev.fr ├── test.en ├── test.fr ├── train.en └── train.fr
Step 2. Binarize the parallel data
This step binarizes the train/dev/test data for training with Fairseq. We need to binarize/preprocess the data in a different way for each training process.
NMT from random initialization. This involves training a joint sentencepiece model on the source+target training data, segmenting the text with that sentencepiece model and finally binarizing the data. See the source code for the script usage instructions.
```shell
First, train SPM and binarize the regular training data
bash preparedatatranslation.sh \ ./data/parallelenfr.regular \ ./tok/parallelenfr.regular.random \ ./data-bin/parallelenfr.regular.random \ train dev test en fr
Next, reuse the pretrained SPM and binarize the idiom training data
bash preparedatatranslation.sh \ ./data/parallelenfr.idiomtrain \ ./tok/parallelenfr.idiomtrain.random \ ./data-bin/parallelenfr.idiomtrain.random \ train '' '' en fr \ ./vocab/parallelen_fr.regular ```
Also, create a split which contains both the regular and the idiom data, by
symlinking them into data-bin/parallel_en_fr.regular+idiom.random.
shell
bash combine_data.sh \
data-bin/parallel_en_fr.idiom_train.random \
data-bin/parallel_en_fr.regular.random \
data-bin/parallel_en_fr.regular+idiom.random \
en fr
NMT from mBART initialization. This involves segmenting the text mBART's sentencepiece model and then binarizing the data. See the source code for the script usage instructions.
```shell
tokenize and binarize the different splits
for split in regular idiomtrain; do bash preparedatambartfinetuning.sh \ ./data/parallelenfr.${split} \ ./tok/parallelenfr.${split}.mbart \ ./data-bin/parallelenfr.${split}.mbart \ train dev test enXX frXX done
symlink the regular and idiom data to create the joint split
bash combinedata.sh \ data-bin/parallelenfr.idiomtrain.mbart \ data-bin/parallelenfr.regular.mbart \ data-bin/parallelenfr.regular+idiom.mbart \ enXX frXX ```
Step 3. Launch the experiments
For running experiments, read the documentation in ./experiments/README.md.
Checkpoints and results
The checkpoints of each model are saved in the ./checkpoints/ directory.
Inside each model's folder, you will find the checkpoints (last.pt and best.pt)
for that particular model, and all the model outputs and scores. Here is an
example of the structure for the model enfr_joint.random:
./checkpoints/enfr_joint.random/: besides the checkpoints you will find the model outputs and scores for the generic MT eval./checkpoints/enfr_joint.random/parallel_en_fr.idiom_test/: model outputs and scores for the idiom-specific eval./checkpoints/enfr_joint.random/analysis/: model outputs and scores for all the analysis methods. Theoutputs.jsoncontains all the results together.
Results
To collect all the results, run:
shell
bash collect_results.sh
It will save the results in .csv files, organized by language pair and experiments-vs-analysis:
- enfr.results.analysis.csv
- enfr.results.experiments.csv
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 2
- Total pull requests: 8
- Average time to close issues: about 7 hours
- Average time to close pull requests: 7 days
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 0.63
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 6
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- leileilin (1)
Pull Request Authors
- dependabot[bot] (4)
- shuoyangd (2)