https://github.com/google-research/mrl_eval

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.5%) to scientific vocabulary

Keywords from Contributors

archival projection profiles interactive sequences generic observability autograding hacking shellcodes

Last synced: 6 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: google-research
License: apache-2.0
Language: Python
Default Branch: main
Size: 130 KB

Statistics

Stars: 7
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed 10 months ago

Metadata Files

Readme Contributing License

MRLEval - a benchmark for morphologically rich languages

Note: This is not an officially supported Google product.

Introduction

This repository contains code for downloading, processing, fine-tuning, running inference, and evaluating models in a fine-tuning setting on various natural language tasks in Hebrew, Modern Standard Arabic and Levantine Arabic. The tasks are detailed below.

Scripts are included to fine-tune encoder-decoder and decoder LLMs, and generate test set predictions using both Huggingface transformers and T5X. An evaluation script calculates performance metrics from these predictions. Baseline results on all tasks using mt5-XL are provided.

Tasks

The following tasks are supported:

| Language | Name |------------------------|------- | Hebrew | HeQ | Hebrew | HeQ-QG | Hebrew | HeSum | Hebrew | HebSummaries | Hebrew | HeSentiment | Hebrew | Nemo-Token | Hebrew | Nemo-Morph | Hebrew | HebNLI | Hebrew | HebCo | | | Modern Standard Arabic | ArQ-MSA-QA | Modern Standard Arabic | ArQ-MSA-QG | Modern Standard Arabic | ArTyDiQA-QA | Modern Standard Arabic | ArTyDiQA-QG | Modern Standard Arabic | IAHLT-NER | Modern Standard Arabic | ArXLSum | Modern Standard Arabic | ArabicNLI | | | Levantine Arabic | ArSentiment | Levantine Arabic | ArCoref | Levantine Arabic | ArQ-Spoken-QA | Levantine Arabic | ArQ-Spoken-QG | Task | Metric | Paper / Page | -------------|----------------------------|-----------|------------------------------------------------------------------------------------------------------------| | Question Answering | TLNLS | Paper | | Question Generation | Rouge | Paper | | Summarization | Rouge | Paper | | Summarization | Rouge | Page | | Sentiment Analysis | Macro F1 | Page | | NER (token level) | F1 | Paper | | NER (morph level) | F1 | Paper | | Natural Language Inference | Macro F1 | Page | | Coreference Resolution | Macro F1 | Page | | |
| Question Answering | TLNLS | Page | | Question Generation | Rouge | Page | | Question Answering | TyDiQA-F1 | Page | | Question Generation | Rouge | Page | | Named Entity Recognition | F1 | Page | | Summarization | Rouge | Page | | Natural Language Inference | Macro F1 | Page | | |
| Sentiment Analysis | Macro F1 | Page | | Coreference | Macro F1 | Page | | Question Answering | TLNLS | Page | | Question Generation | Rouge | Page |

Setup

Note that this package requires Python 3.10 or higher.

First, clone the repository:

bash git clone https://github.com/google-research/mrl_eval.git

Then, install the requirements, preferably in a new virtual environment.

bash pip install torch --index-url https://download.pytorch.org/whl/cu118 pip install -r requirements.txt

Data

Download and preprocess raw data for all tasks:

bash bash mrl_eval/datasets/download_raw_data.sh bash mrl_eval/datasets/ingest_all_datasets.sh

Evaluation

To evaluate the score of model predictions, run:

bash python -m mrl_eval.evaluation.evaluate --dataset {dataset} --predictions_path path/to/prediction/file

The options for dataset are:

heq
heqquestiongen
hesum
hebsummaries
hesentiment
nemo_token
nemo_morph
hebnli
hebco
arabic_nli
arq_MSA
arqMSAquestion_gen
arq_spoken
arqspokenquestion_gen
arsentiment
arcoref
artydiqa
artydiqaquestiongen
ar_xlsum
iahlt_ner

Your predictions file is expected to be a jsonl file in the following format:

json {"input": {"id": "example_id_1"}, "prediction": "prediction1"} {"input": {"id": "example_id_2"}, "prediction": "prediction2"} ...

Baseline results

We finetune mT5-xl model per task as the first baseline. Results are shown in the table below.

| Language | Model | Task | Metric | Value | |------------------------|---------|------------------|------------------|---------------------| | Hebrew | mT5-XL | HeQ | TLNLS | 87.1 | | Hebrew | mT5-XL | HeQ-QG | R1/R2/RL | 40.2 / 22.0 / 39.7 | | Hebrew | mT5-XL | HeSum | R1/R2/RL | 17.9 / 7.2 / 15.0 | | Hebrew | mT5-XL | HebSummaries | R1/R2/RL | 23.9 / 10.1 / 16.6 | | Hebrew | mT5-XL | NEMO | Token / Morph F1 | 86.3 / 84.8 | | Hebrew | mT5-XL | Sentiment | Macro F1 | 85.0 | | Hebrew | mT5-XL | HebNLI | Macro F1 | 84.6 | | Hebrew | mT5-XL | Hebco | Macro F1 | 49.3 | | | | | | | | Modern Standard Arabic | mT5-XL | ArQ-MSA-QA | TLNLS | 79.5 | | Modern Standard Arabic | mT5-XL | ArQ-MSA-QG | R1/R2/RL | 35.8 / 17.2 / 35.5 | | Modern Standard Arabic | mT5-XL | ArTyDi-QA | TyDiQA-F1 | 87.4 | | Modern Standard Arabic | mT5-XL | ArTyDi-QG | R1/R2/RL | 60.6 / 44.1 / 60.5 | | Modern Standard Arabic | mT5-XL | IAHLT-NER | Token F1 | 64.6 | | Modern Standard Arabic | mT5-XL | ArabicNLI | Macro F1 | 82.2 | | Modern Standard Arabic | mT5-XL | ArXLSum | R1/R2/RL | 26.5 / 11.4 / 23.4 | | | | | | | | Levantine Arabic | mT5-XL | ArSentiment | Macro F1 | 71.2 | | Levantine Arabic | mT5-XL | ArCoref | Macro F1 | 50.1 | | Levantine Arabic | mT5-XL | ArQ-spoken-QA | TLNLS | 81.8 | | Levantine Arabic | mT5-XL | ArQ-spoken-QG | R1/R2/RL | 35.6 / 16.6 / 35.3 |

Fine-tuning and inference

Huggingface

To finetune on a specific dataset:

bash python -m mrl_eval.hf.finetune --dataset {dataset}

By default, this will train mt5-xl. To train a different model (e.g. a decoder LLM) specify its HF model name as follows:

bash python -m mrl_eval.hf.finetune --dataset {dataset} --model "google/gemma-2-9b"

Decoder model will be trained by default with LORA using half precision.

The options for dataset are the same as above.

Once the training is done, the script will print the path to the best checkpoint.

To generate response for the inputs of the test set:

bash python -m mrl_eval.hf.generate --dataset {dataset} --checkpoint_path path/to/checkpoint

T5X

Establishing a GCP project

First, follow the guidelines at XManager for establishing a google cloud project. Specifically, follow the guidelines for setting up a Google Cloud project. You will be using two cloud infrastructures: a bucket for storing your training outputs (logs, model checkpoints) and a compute engine where you will run the project. We will be using the bucket path in the training and inference scripts. Follow the instructions at T5X to request an appropriate VM. We will be setting up the project environment inside this VM.

Setting up MRLEval in GCP

Second, proceed to build the environment inside your compute engine. All of the following should happen from your GCP VM:

1. Follow the instruction to install T5X as well as XManager.

We will be using the path to the cloned T5X repo in the training and inference scripts.

2. Clone MRLEval

No need to install the requirements; this will be handled implicitly by XManager via the fine-tune and inference script arguments.

3. Run Data Ingestion.

Download and preprocess raw data for all tasks (note the save_tfrecord flag):

bash bash mrl_eval/datasets/download_raw_data.sh bash mrl_eval/datasets/ingest_all_datasets.sh save_tfrecord

At this point your project structure will be similar to:

${HOME} └── some_dir └── main_project_dir ├── mrl_eval # where you cloned mrl_eval └── mrl_eval_data # a directory for data outputs, will be created when running Data-Ingestion └── cloned_t5x_repo # where you cloned t5x

It is important that your ingested datasets will be located at the data directory that shares the root mainprojectdir with the cloned mrl_eval repo. This should happen on its own when ingesting the data (3.).
The naming in the following section refers to this example.

4. Define the following variables before running the scripts:

export GOOGLE_CLOUD_BUCKET_NAME=<your_bucket_name> # Without the gs:// prefix export PROJECT_DIR_ROOT=<${HOME}/some_dir/main_project_dir> export T5X_DIR=<${HOME}/some_dir/cloned_t5x_repo>

5. Finetuning mT5-xl and Running Inference

The finetuning script expects two argument: you name for the experiment and a path to a gin configuration defining the training on a given task. All finetune and inference configurations for mT5-xl can be found under mrl_eval/models/gin/finetune_gin_configs and mrl_eval/models/gin/inference_gin_configs respectively.

To finetune mT5-xl on a given task, e.g. summarisation (hesum), run:

bash cd ${HOME}/some_dir/main_project_dir/scripts sh xm_finetune.sh <your_chosen_name_for_the_experiment> mrl_eval/models/gin/finetune_gin_configs/finetune_mt5_xl_hesum.gin

Similarly, to run inference on a checkpoint in your bucket (checkpoints are saved to your bucket), run:

bash cd ${HOME}/some_dir/main_project_dir/scripts sh xm_infer.sh <your_chosen_name_for_the_inference> <task_eval_gin> <the_path_to_the_checkpoint>

e.g. to evaluate the hesum checkpoint at gs://mybucket/t5x/hesumexp/20240722/logs/checkpoint_1004096

run:

bash cd ${HOME}/some_dir/main_project_dir/scripts sh xm_infer.sh infer_mt5xl_hesum mrl_eval/models/gin/inference_gin_configs/eval_mt5_xl_hesentiment.gin gs://my_bucket/t5x/hesum_exp/20240722/logs/checkpoint_1004096

Owner

Name: Google Research
Login: google-research
Kind: organization
Location: Earth

Website: https://research.google
Repositories: 226
Profile: https://github.com/google-research

GitHub Events

Total

Watch event: 2
Member event: 1
Push event: 5
Fork event: 1

Last Year

Watch event: 2
Member event: 1
Push event: 5
Fork event: 1

Committers

Last synced: 8 months ago

All Time

Total Commits: 10
Total Committers: 4
Avg Commits per committer: 2.5
Development Distribution Score (DDS): 0.6

Past Year

Commits: 10
Committers: 4
Avg Commits per committer: 2.5
Development Distribution Score (DDS): 0.6

Top Committers

Name	Email	Commits
Uri Shaham	u**m@g**m	4
Guy Mor-Lan	g**r@m**l	4
dependabot[bot]	4****]	1
Uri Shaham	u**1@e**m	1

Committer Domains (Top 20 + Academic)

email.com: 1 mail.huji.ac.il: 1 google.com: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: about 3 hours
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: about 3 hours
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

Pull Request Authors

dependabot[bot] (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (1)

Dependencies

requirements.txt pypi

Levenshtein *
accelerate *
immutabledict *
pandas *
rich *
rouge_score *
sentencepiece *
tensorflow ==2.12.1
tqdm *
transformers *

https://github.com/google-research/mrl_eval

Science Score: 46.0%

Keywords from Contributors

Repository

Basic Info

Statistics

Metadata Files

README.md

MRLEval - a benchmark for morphologically rich languages

Introduction

Tasks

Setup

Data

Evaluation

Baseline results

Fine-tuning and inference

Huggingface

T5X

Establishing a GCP project

Setting up MRLEval in GCP

1. Follow the instruction to install T5X as well as XManager.

2. Clone MRLEval

3. Run Data Ingestion.

4. Define the following variables before running the scripts:

5. Finetuning mT5-xl and Running Inference

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies