coppermt

[ACL 2021, Findings] Cognate Prediction Per Machine Translation

https://github.com/clefourrier/coppermt

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary

Keywords

acl2021 cognate-prediction cognates fairseq low-resource-languages low-resource-machine-translation machine-translation nmt smt
Last synced: 6 months ago · JSON representation ·

Repository

[ACL 2021, Findings] Cognate Prediction Per Machine Translation

Basic Info
  • Host: GitHub
  • Owner: clefourrier
  • License: gpl-3.0
  • Language: JavaScript
  • Default Branch: master
  • Homepage:
  • Size: 37.2 MB
Statistics
  • Stars: 10
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
acl2021 cognate-prediction cognates fairseq low-resource-languages low-resource-machine-translation machine-translation nmt smt
Created over 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

CopperMT - Cognate Prediction per MT

This repository contains the code for ACL 2021 Findings paper: Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task?

Overview

We provide a pipeline, based on fairseq, to train bilingual or multilingual NMT models, with either pretraining or backtranslation. They can also be compared to SMT models (using MOSES). The scripts can be used as such to reproduce our results, or modified to fit your analyses.

Our results on cognate prediction for some Romance languages, when comparing multilingual transformer and RNN:

Repository organisation

  • inputs
    • rawdata and splitdata (contains bilingual aligned datasets or monolingual source2source or target2target datasets)
    • parameters (parameter files for the models)
  • pipeline
    • data (extractor for EtymDB, dataraw on raw aligned daa, data on split aligned data)
    • neural_translation (scripts to use our Multi Encoder multi Decoder Architecture, MEDeA)
    • statistical translation (scripts to use and finetune MOSES)
    • utils (for bleu use)
    • various mains and parameters.cfg

How to use

Remark: This code has been tested on Unix-like systems (MacOS, Ubuntu, Manjaro).

1) Install the requirements

requirements.txt

You can create a virtualenv bash virtualenv -p python3 pyenv source pyenv/bin/activate pip install -r requirements.txt If you intend to extract and phonetize data, pleae install espeak manually

git submodules (optional)

If you want to do SMT (and use MOSES and mgiza) or extract cognate data yourself (using EtymDB), you will need the submodules. You can skip MOSES install if you already installed it somewhere else on your machine.

1) Install boost (>1.64), and follow the documentation of MOSES regarding extra packages you might need depending on your distribution 2) Then, initialize the submodules using bash git submodule init git submodule update 3) Finish the install of mgiza: cd submodules/mgiza/mgizapp; cmake .; make; make install. The MGIZA binary and the script mergealignment.py need to be copied in your binary directory that Moses will look up for word alignment tools (in our case, submodule/mgiza/mgizapp/bin) `cp scripts/mergealignment.py bin/ 4) Finish the install of moses:cd ../../mosesdecoder; bjam -j4 -q -d2` (if on mac you might need to checkout the branch clang-error and correct the errors during the bjam build)

2) Edit the parameter files

Edit your parameter files, change MEDeA_DIR to the path of your installation.

3) Reproduce the paper results: launch the scripts

bash cd pipeline bash main_<your script of choice>.sh parameters.cfg

If you want to run the code on other datasets

You can choose to extract cognate data from EtymDB using pipeline/data/extractorscriptcognates.py, monolingual data from YaMTG using pipeline/data/extractorscriptmonolingual.py, or use your own data.

If you use your own data, you will first need to phonetize it. In this paper, we used espeak text2speech, which is available for a number of widely spoken languages (and Latin for example). To use it if your languages are available, use code similar to what can be found in pipeline/data/extractorscriptmonolingual.py, in the phonetize case. Then you will need to tokenize it (separate all phones on space, while sticking diacritics to the relevant phone, managing double consonnants in an homogeneous fashion, ...), for which there is a util function clean_phones at line 113 of file pipeline/data/management/from_file/utils.py. (This step is automatically done for EtymDB/YamTG extraction).

Then, you will need to create binary files for fairseq to be able to manage your data, which should be done almost without modifications by the script pipeline/neuraltranslation/datapreprocess.sh. (You will need to adapt the vocabulary size to your vocabulary size).

pipeline/data/datarawshufflesplit.sh will finally shuffle your datasets to create random train/dev/test sets for you.

Licence

All code here is mine (clefourrier) except for the spmtrain.py script (in pipeline/neuraltranslation/) which comes from fairseq (under MIT licence) and has been added here for convenience. My code is under GNU GPL 3.

Attribution

If you use the code, models or algorithms, please cite: ``` @inproceedings{fourrier-etal-2021-cognate, title = "Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task?", author = "Fourrier, Cl{\'e}mentine and Bawden, Rachel and Sagot, Beno{^\i}t", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = "aug", year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.75", doi = "10.18653/v1/2021.findings-acl.75", pages = "847--861", }

```

Owner

  • Name: Clémentine Fourrier
  • Login: clefourrier
  • Kind: user
  • Location: France
  • Company: @huggingface

Researcher at 🤗

Citation (CITATION.bib)

@inproceedings{fourrier-etal-2021-cognate,
    title = "Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task?",
    author = "Fourrier, Cl{\'e}mentine  and
      Bawden, Rachel  and
      Sagot, Beno{\^\i}t",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = "aug",
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.75",
    doi = "10.18653/v1/2021.findings-acl.75",
    pages = "847--861",
}

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: 2 days
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 3.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • fairseq *
  • jupyter *
  • matplotlib *
  • networkx *
  • numpy *
  • pandas *
  • sacrebleu *
  • sentencepiece *
  • sklearn *
  • torch *
  • torchvision *