https://github.com/amazon-science/memerag

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary

Keywords

benchmark evaluation rag

Last synced: 5 months ago · JSON representation

Repository

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Basic Info

Host: GitHub
Owner: amazon-science
License: other
Language: Python
Default Branch: main
Homepage:
Size: 2.94 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 2
Releases: 0

Topics

benchmark evaluation rag

Created 11 months ago · Last pushed 10 months ago

Metadata Files

Readme Contributing License Code of conduct

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Authors: María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

Overview

This repository contains MEMERAG dataset. Its intended uses are: 1. Model selection: Evaluate and compare different LLMs for their effectiveness as judges in the "LLM-as-a-judge" setting. 2. Prompt selection: Optimize prompts for LLMs acting as judges in RAG evaluation tasks.

This is a meta-evaluation benchmark. Data in the benchmark should not be used to train models.

Datasets

We provide three variants of the meta-evaluation datasets: 1. MEMERAG (data/memerag/): The original dataset 2. MEMERAG-EXT (data/memerag_ext/): Extended dataset with faithfulness and relevance labels from 5 human annotators 3. MEMERAG-EXT with Majority Voting (data/memerag_ext_w_majority_voting/): Contains labels obtained through majority voting over 5 annotations.

Each dataset provides RAG meta-evaluation support for English (EN), German (DE), Spanish (ES), French (FR), and Hindi (HI) and is provided in JSONL format with the following structure:

queryid: unique identifier for a query. This is same as MIRACL queryid.
query: The actual question asked
context: Passages used for generating the answer
answer: List of dictionaries containing
- sentence_id: The id of the sentence
- sentence: A sentence from the answer
- finegrainedfactuality: Fine grained faithfulness label
- factuality: Faithfulness label which is one of Supported, Not Supported and Challenging to determine
- relevance: Label representing answer relevance
- comments: Notes by annotators.

Setup

Create a new virtual environment with python 3.10 shell conda create -n yourenvname python=3.10
Install dependencies shell pip install -r requirements.txt

Run the benchmark

The Python script in src/run_benchmark.py will run the automated evaluator for a specified prompt and language combination on the MEMERAG dataset.

Usage

python run_benchmark.py [arguments]

Required Arguments --lang: Target language for evaluation (choices: 'en', 'es', 'de', 'fr', 'hi') --dataset_name: The dataset you like to use. Valid values are `memerag` and `memerag_ext_w_majority_vote` --model_id: LLM model to use (choices: "gpt4o_mini", "llama3_2-90b", "llama3_2-11b", "qwen_2_5-32B") --sys_prompt_path: Path to system prompt template file. Sample system prompt can be found in the *prompts* directory --task_prompt_path: Path to task prompt template file. Sample task prompt can be found in the *prompts* directory

Optional Arguments --temperature: Temperature parameter for LLM (default: 0.1) --top_p: Top-p sampling parameter (default: 0.1) --bedrock_region: AWS region for Bedrock models --aws_profile: AWS profile name --num_proc: Number of parallel LLM API calls (default: 4) --num_retries: Number of retry attempts for LLM calls (default: 6)

Supported Evaluator Models

GPT-4o Mini (gpt4o_mini)
- Requires setting the OPENAI_API_KEY as an environment variable.
Llama 3.2 (11B and 90B) Models
- llama3_2-90b and llama3_2-11b can be used via AWS Bedrock. Please provide appropriate AWS credentials in ~/.aws/credentials
Qwen 2.5 32B (qwen_2_5-32B)
- Can be used via the VLLM inference server.

Example

Run Llama 3.2 90B as an automatic evaluator with AG + COT prompt.

```shell cd src

python runbenchmark.py --lang de --datasetname memerag --modelid llama32-90b \ --syspromptpath prompts/agcot/sysprompt.md --taskpromptpath prompts/agcot/taskprompt.md --num_proc 5 ```

The script prints balanced accuracy and saves judgments to llm_judged_dataset.csv in the specified output directory.

Citation

If you found the benchmark useful, please consider citing our work. {bibtex} @misc{Cruz2025, title={MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation}, author={María Andrea Cruz Blandón and Jayasimha Talur and Bruno Charron and Dong Liu and Saab Mansour and Marcello Federico}, year={2025}, eprint={2502.17163}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.17163}, }

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Issues event: 1
Watch event: 3
Delete event: 1
Push event: 1
Public event: 1
Pull request event: 2
Create event: 1

Last Year

Issues event: 1
Watch event: 3
Delete event: 1
Push event: 1
Public event: 1
Pull request event: 2
Create event: 1

Dependencies

requirements.txt pypi

annotated-types ==0.7.0
anyio ==4.8.0
boto3 ==1.37.0
botocore ==1.37.0
certifi ==2025.1.31
charset-normalizer ==3.4.1
distro ==1.9.0
exceptiongroup ==1.2.2
h11 ==0.16.0
httpcore ==1.0.7
httpx ==0.28.1
idna ==3.10
jinja2 ==3.1.5
jiter ==0.8.2
jmespath ==1.0.1
joblib ==1.4.2
jsonpatch ==1.33
jsonpointer ==3.0.0
langchain-aws ==0.2.13
langchain-core ==0.3.39
langsmith ==0.3.11
markupsafe ==3.0.2
numpy ==1.26.4
openai ==1.64.0
orjson ==3.10.15
packaging ==24.2
pandas ==2.2.3
pydantic ==2.10.6
pydantic-core ==2.27.2
python-dateutil ==2.9.0.post0
pytz ==2025.1
pyyaml ==6.0.2
requests ==2.32.3
requests-toolbelt ==1.0.0
s3transfer ==0.11.2
scikit-learn ==1.6.1
scipy ==1.15.2
six ==1.17.0
sniffio ==1.3.1
tenacity ==9.0.0
threadpoolctl ==3.5.0
tqdm ==4.67.1
typing-extensions ==4.12.2
tzdata ==2025.1
urllib3 ==2.3.0
zstandard ==0.23.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amazon-science/memerag

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Overview

Datasets

Setup

Run the benchmark

Citation

Owner

GitHub Events

Total

Last Year

Dependencies