https://github.com/baohaoliao/multillama

https://github.com/baohaoliao/multillama

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: BaohaoLiao
  • License: mit
  • Language: Ruby
  • Default Branch: main
  • Size: 25.4 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Code of conduct

README.md

ALMA

# ALMA: Advanced Language Model-based translator

follow on Twitter

ALMA (Advanced Language Model-based TrAnslator) is a many-to-many LLM-based translation model, which adopts a new translation model paradigm: it begins with fine-tuning on monolingual data and is further optimized using high-quality parallel data. This two-step fine-tuning process ensures strong translation performance.

ALMA-R (NEW!) builds upon ALMA models, with further LoRA fine-tuning with our proposed Contrastive Preference Optimization (CPO) as opposed to the Supervised Fine-tuning used in ALMA. CPO fine-tuning requires our triplet preference data for preference learning. ALMA-R now can matches or even exceeds GPT-4 or WMT winners!

The original ALMA repository can be found here.

News 🌟

⭐ Mar.22 2024 CPO method now is merged at huggingface trl! See details here.

⭐ Jan.16 2024 ALMA-R is Released! Please check more details with our new paper: Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation.

⭐ Jan.16 2024 The ALMA paper: A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models has been accepted at ICLR 2024! Check out more details here!

Contents 📄

:star: Supports :star: - AMD and Nvidia Cards - Data Parallel Evaluation - Also support LLaMA-1, LLaMA-2, OPT, Faclon, BLOOM, MPT - LoRA Fine-tuning - Monolingual data fine-tuning, parallel data fine-tuning

Download ALMA(-R) Models and Dataset 🚀

We release six translation models presented in the paper: - ALMA-7B - ALMA-7B-LoRA - ALMA-7B-R (NEW!): Further LoRA fine-tuning upon ALMA-7B-LoRA with contrastive preference optimization. - ALMA-13B - ALMA-13B-LoRA - ALMA-13B-R (NEW!): Further LoRA fine-tuning upon ALMA-13B-LoRA with contrastive preference optimization (BEST MODEL!).

We have also provided the WMT'22 and WMT'23 translation outputs from ALMA-13B-LoRA and ALMA-13B-R in the outputs directory. These outputs also includes our outputs of baselines and can be directly accessed and used for subsequent evaluations.

Model checkpoints are released at huggingface: | Models | Base Model Link | LoRA Link | |:-------------:|:---------------:|:---------:| | ALMA-7B | haoranxu/ALMA-7B | - | | ALMA-7B-LoRA | haoranxu/ALMA-7B-Pretrain | haoranxu/ALMA-7B-Pretrain-LoRA | | ALMA-7B-R (NEW!) | haoranxu/ALMA-7B-R (LoRA merged) | - | | ALMA-13B | haoranxu/ALMA-13B | - | | ALMA-13B-LoRA | haoranxu/ALMA-13B-Pretrain | haoranxu/ALMA-13B-Pretrain-LoRA | | ALMA-13B-R (NEW!) | haoranxu/ALMA-13B-R (LoRA merged) | - |

Note that ALMA-7B-Pretrain and ALMA-13B-Pretrain are NOT translation models. They only experience stage 1 monolingual fine-tuning (20B tokens for the 7B model and 12B tokens for the 13B model), and should be utilized in conjunction with their LoRA models.

Datasets used by ALMA and ALMA-R are also released at huggingface now (NEW!) | Datasets | Train / Validation| Test | |:-------------:|:---------------:|:---------:| | Human-Written Parallel Data (ALMA) | train and validation | WMT'22 | | Triplet Preference Data | train | WMT'22 and WMT'23 |

A quick start to use our best system (ALMA-13B-R) for translation. An example of translating "我爱机器翻译。" into English: ``` import torch from transformers import AutoModelForCausalLM from transformers import AutoTokenizer

Load base model and LoRA weights

model = AutoModelForCausalLM.frompretrained("haoranxu/ALMA-13B-R", torchdtype=torch.float16, devicemap="auto") tokenizer = AutoTokenizer.frompretrained("haoranxu/ALMA-13B-R", padding_side='left')

Add the source sentence into the prompt template

prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:" inputids = tokenizer(prompt, returntensors="pt", padding=True, maxlength=40, truncation=True).inputids.cuda()

Translation

with torch.nograd(): generatedids = model.generate(inputids=inputids, numbeams=5, maxnewtokens=20, dosample=True, temperature=0.6, topp=0.9) outputs = tokenizer.batchdecode(generatedids, skipspecial_tokens=True) print(outputs) ```

The general translation prompt is: Translate this from <source language name> into <target language name>: <source language name>: <source language sentence> <target language name>:

Environment Setup 🔧

conda create -n alma-r python=3.11 conda activate alma-r If you use Nvidia GPUs, install torch with cuda 11.8 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 If you use AMD GPUs, install torch with ROCm 5.6 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6 Then install other dependencies: bash install_alma.sh

Evaluation 💻

Evaluation on ALMA-13B-R

This is a quick start to evaluate our ALMA-13B-R model. To produce translation outputs for WMT'22 in both en→cs and cs→en directions (If you want to evaluate WMT'23 instead, simply pass --override_test_data_path haoranxu/WMT23-Test. Please look at evals/alma_13b_r_wmt23.sh as an example), run the following command: ``` accelerate launch --configfile configs/deepspeedevalconfigbf16.yaml \ runllmmt.py \ --modelnameorpath haoranxu/ALMA-13B-R \ --dopredict \ --lowcpumemusage \ --languagepairs en-cs,cs-en \ --mmtdatapath ./humanwrittendata/ \ --perdeviceevalbatchsize 1 \ --outputdir ./youroutputdir/ \ --predictwithgenerate \ --maxnewtokens 256 \ --maxsourcelength 256 \ --bf16 \ --seed 42 \ --numbeams 5 \ --overwritecache \ --overwriteoutputdir

The generated outputs will be saved in the `your_output_dir`. The translation file for the `en→cs` direction is named `test-en-cs`, and the file for the cs→en direction is `test-cs-en`. We have prepared a bash file for the user to easily run the evaluation: bash evals/alma13br.sh ${youroutputdir} ${testpairs} `` The variable${testpairs}denotes the translation directions you wish to evaluate. It supports testing multiple directions at once. For example, you can usede-en,en-de,en-cs,cs-en`. Once the bash script completes its execution, both the BLEU scores and COMET results will be automatically displayed.

Note that this will perform data-parallel evaluation supported by deepspeed: that is, placing a single full copy of your model onto each available GPU and splitting batches across GPUs to evaluate on K GPUs K times faster than on one. For those with limited GPU memory, we offer an alternative method. The user can pass --multi_gpu_one_model to run the process by distributing a single model across multiple GPUs. Please see evaluation examples in evals/alma_13b_r.sh or evals/*no_parallel files.

Few-Shot In-Context Learning

To append examples in the prompt, you need to pass the --few_shot_eval_path flag and specify the location of their shot files. As a demonstration, you can execute the following command: bash evals/llama-2-13b-5-shot.sh ${OUTPUT_DIR} ${test_pairs}

Training 🔥

Here we show how to - (NEW!) contrastive Preference Optmization Upon ALMA Models (ALMA→ALMA-R). - fine-tune LLaMA-2-7B on monolingual OSCAR data (stage 1) - fine-tune human-written parallel data fine-tuning once stage 1 is completed, including full-weight and LoRA fine-tuning (stage 2)

CPO Fine-Tuning

To run the CPO fine-tuning with our triplet preference data, run the following command: bash runs/cpo_ft.sh ${your_output_dir}

OSCAR Monolingual Fine-Tuning

To execute the OSCAR monolingual fine-tuning, use the following command: bash runs/mono_ft.sh ${your_output_dir}

Parallel Data Fine-Tuning (Full-Weight)

Once the monolingual data fine-tuning is complete, proceed to the parallel data fine-tuning using the full-weight approach. Execute the following command: bash runs/parallel_ft.sh ${your_output_dir} $training_pairs$ where training_pairs is the translation directions you considered. The default is all 10 directions: de-en,cs-en,is-en,zh-en,ru-en,en-de,en-cs,en-is,en-zh,en-ru.

Parallel Data Fine-Tuning (LoRA)

In Stage 2, there's also an option to employ LoRA for fine-tuning on the parallel data. To do so, execute the following command: bash runs/parallel_ft_lora.sh ${your_output_dir} $training_pairs$

Data Information 💾

Human-written training dataset, along with the WMT'22 test dataset, can be found in the human_written_data directory. Within this directory, there are five subfolders, each representing one of the five language pairs. Each of these subfolders contains the training, development, and test sets for its respective language pair. ``` -deen -train.de-en.json -valid.de-en.json -test.de-en.json -test.en-de.json .... -csen -isen -ruen .....

The data format in json files must be: { "translation": { "src(de)": "source sentence", "tgt(en)": "target sentence", } } `` Within this directory, there are two additional subfolders specifically designed for few-shot in-context learning: -Filtered-5-shot: This contains the "filtered shots" as referenced in the paper. -HW-5-shot`: This contains the "randomly extracted human-written data" mentioned in the paper.

FAQs ❓

What language directions do ALMA and ALMA-R support?

Currently, ALMA supports 10 directions: English↔German, Englishs↔Czech, Englishs↔Icelandic, Englishs↔Chinese, Englishs↔Russian. However, it may surprise us in other directions :)

When should I stop fine-tuning at stage 1?

Our 7B and 13B models are trained on 20B and 12B tokens, respectively. However, as indicated in the paper, fine-tuning 1B tokens should boost the performance substantially. The steps required to fine-tune 1 billion tokens also vary based on your batch size. In our case, the batch size is calculated as follows: 16 GPUs * 4 (batch size per GPU) * 4 (gradient accumulation steps) = 256. With a sequence length of 512, we need approximately 8,000 steps to train on 1 billion tokens, calculated as 10^9 / (256*512) ≈8000 steps. However, you may choose to fine-tune more steps to get better performance.

How to decide the interleave probability at stage 1?

Please find the reasons for interleave probability selection for stage 1 in Appendix D.1 in the paper!

Reference

Please find more details for ALMA models in our paper or the summary of the paper. @misc{xu2023paradigm, title={A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models}, author={Haoran Xu and Young Jin Kim and Amr Sharaf and Hany Hassan Awadalla}, year={2023}, eprint={2309.11674}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Please also find more detailed information for the ALMA-R model with Contrastive Preference Optimization in the paper. @misc{xu2024contrastive, title={Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation}, author={Haoran Xu and Amr Sharaf and Yunmo Chen and Weiting Tan and Lingfeng Shen and Benjamin Van Durme and Kenton Murray and Young Jin Kim}, year={2024}, eprint={2401.08417}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Owner

  • Name: baohao
  • Login: BaohaoLiao
  • Kind: user
  • Location: Netherlands
  • Company: University of Amsterdam

PhD candidate @ltl-uva for NLP

GitHub Events

Total
Last Year