https://github.com/amazon-science/memerag
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary
Keywords
Repository
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 2
- Releases: 0
Topics
Metadata Files
README.md
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
Authors: María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico
Overview
This repository contains MEMERAG dataset. Its intended uses are: 1. Model selection: Evaluate and compare different LLMs for their effectiveness as judges in the "LLM-as-a-judge" setting. 2. Prompt selection: Optimize prompts for LLMs acting as judges in RAG evaluation tasks.
This is a meta-evaluation benchmark. Data in the benchmark should not be used to train models.
Datasets
We provide three variants of the meta-evaluation datasets:
1. MEMERAG (data/memerag/): The original dataset
2. MEMERAG-EXT (data/memerag_ext/): Extended dataset with faithfulness and relevance labels from 5 human annotators
3. MEMERAG-EXT with Majority Voting (data/memerag_ext_w_majority_voting/): Contains labels obtained through majority voting over 5 annotations.
Each dataset provides RAG meta-evaluation support for English (EN), German (DE), Spanish (ES), French (FR), and Hindi (HI) and is provided in JSONL format with the following structure:
- queryid: unique identifier for a query. This is same as MIRACL queryid.
- query: The actual question asked
- context: Passages used for generating the answer
- answer: List of dictionaries containing
- sentence_id: The id of the sentence
- sentence: A sentence from the answer
- finegrainedfactuality: Fine grained faithfulness label
- factuality: Faithfulness label which is one of
Supported,Not SupportedandChallenging to determine - relevance: Label representing answer relevance
- comments: Notes by annotators.
Setup
- Create a new virtual environment with python 3.10
shell conda create -n yourenvname python=3.10 - Install dependencies
shell pip install -r requirements.txt
Run the benchmark
The Python script in src/run_benchmark.py will run the automated evaluator for a specified prompt and language combination on the MEMERAG dataset.
Usage
python run_benchmark.py [arguments]
Required Arguments
--lang: Target language for evaluation (choices: 'en', 'es', 'de', 'fr', 'hi')
--dataset_name: The dataset you like to use. Valid values are `memerag` and `memerag_ext_w_majority_vote`
--model_id: LLM model to use (choices: "gpt4o_mini", "llama3_2-90b", "llama3_2-11b", "qwen_2_5-32B")
--sys_prompt_path: Path to system prompt template file. Sample system prompt can be found in the *prompts* directory
--task_prompt_path: Path to task prompt template file. Sample task prompt can be found in the *prompts* directory
Optional Arguments
--temperature: Temperature parameter for LLM (default: 0.1)
--top_p: Top-p sampling parameter (default: 0.1)
--bedrock_region: AWS region for Bedrock models
--aws_profile: AWS profile name
--num_proc: Number of parallel LLM API calls (default: 4)
--num_retries: Number of retry attempts for LLM calls (default: 6)
Supported Evaluator Models
GPT-4o Mini (
gpt4o_mini)- Requires setting the
OPENAI_API_KEYas an environment variable.
- Requires setting the
Llama 3.2 (11B and 90B) Models
llama3_2-90bandllama3_2-11bcan be used via AWS Bedrock. Please provide appropriate AWS credentials in~/.aws/credentials
Qwen 2.5 32B (
qwen_2_5-32B)- Can be used via the VLLM inference server.
Example
Run Llama 3.2 90B as an automatic evaluator with AG + COT prompt.
```shell cd src
python runbenchmark.py --lang de --datasetname memerag --modelid llama32-90b \ --syspromptpath prompts/agcot/sysprompt.md --taskpromptpath prompts/agcot/taskprompt.md --num_proc 5 ```
The script prints balanced accuracy and saves judgments to llm_judged_dataset.csv in the specified output directory.
Citation
If you found the benchmark useful, please consider citing our work.
{bibtex}
@misc{Cruz2025,
title={MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation},
author={María Andrea Cruz Blandón and Jayasimha Talur and Bruno Charron and Dong Liu and Saab Mansour and Marcello Federico},
year={2025},
eprint={2502.17163},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.17163},
}
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
- Issues event: 1
- Watch event: 3
- Delete event: 1
- Push event: 1
- Public event: 1
- Pull request event: 2
- Create event: 1
Last Year
- Issues event: 1
- Watch event: 3
- Delete event: 1
- Push event: 1
- Public event: 1
- Pull request event: 2
- Create event: 1
Dependencies
- annotated-types ==0.7.0
- anyio ==4.8.0
- boto3 ==1.37.0
- botocore ==1.37.0
- certifi ==2025.1.31
- charset-normalizer ==3.4.1
- distro ==1.9.0
- exceptiongroup ==1.2.2
- h11 ==0.16.0
- httpcore ==1.0.7
- httpx ==0.28.1
- idna ==3.10
- jinja2 ==3.1.5
- jiter ==0.8.2
- jmespath ==1.0.1
- joblib ==1.4.2
- jsonpatch ==1.33
- jsonpointer ==3.0.0
- langchain-aws ==0.2.13
- langchain-core ==0.3.39
- langsmith ==0.3.11
- markupsafe ==3.0.2
- numpy ==1.26.4
- openai ==1.64.0
- orjson ==3.10.15
- packaging ==24.2
- pandas ==2.2.3
- pydantic ==2.10.6
- pydantic-core ==2.27.2
- python-dateutil ==2.9.0.post0
- pytz ==2025.1
- pyyaml ==6.0.2
- requests ==2.32.3
- requests-toolbelt ==1.0.0
- s3transfer ==0.11.2
- scikit-learn ==1.6.1
- scipy ==1.15.2
- six ==1.17.0
- sniffio ==1.3.1
- tenacity ==9.0.0
- threadpoolctl ==3.5.0
- tqdm ==4.67.1
- typing-extensions ==4.12.2
- tzdata ==2025.1
- urllib3 ==2.3.0
- zstandard ==0.23.0