https://github.com/amazon-science/memerag

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

https://github.com/amazon-science/memerag

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary

Keywords

benchmark evaluation rag
Last synced: 5 months ago · JSON representation

Repository

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 2.94 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 2
  • Releases: 0
Topics
benchmark evaluation rag
Created 11 months ago · Last pushed 10 months ago
Metadata Files
Readme Contributing License Code of conduct

README.md

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Authors: María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico

Overview

This repository contains MEMERAG dataset. Its intended uses are: 1. Model selection: Evaluate and compare different LLMs for their effectiveness as judges in the "LLM-as-a-judge" setting. 2. Prompt selection: Optimize prompts for LLMs acting as judges in RAG evaluation tasks.

This is a meta-evaluation benchmark. Data in the benchmark should not be used to train models.

Datasets

We provide three variants of the meta-evaluation datasets: 1. MEMERAG (data/memerag/): The original dataset 2. MEMERAG-EXT (data/memerag_ext/): Extended dataset with faithfulness and relevance labels from 5 human annotators 3. MEMERAG-EXT with Majority Voting (data/memerag_ext_w_majority_voting/): Contains labels obtained through majority voting over 5 annotations.

Each dataset provides RAG meta-evaluation support for English (EN), German (DE), Spanish (ES), French (FR), and Hindi (HI) and is provided in JSONL format with the following structure:

  • queryid: unique identifier for a query. This is same as MIRACL queryid.
  • query: The actual question asked
  • context: Passages used for generating the answer
  • answer: List of dictionaries containing
    • sentence_id: The id of the sentence
    • sentence: A sentence from the answer
    • finegrainedfactuality: Fine grained faithfulness label
    • factuality: Faithfulness label which is one of Supported, Not Supported and Challenging to determine
    • relevance: Label representing answer relevance
    • comments: Notes by annotators.

Setup

  1. Create a new virtual environment with python 3.10 shell conda create -n yourenvname python=3.10
  2. Install dependencies shell pip install -r requirements.txt

Run the benchmark

The Python script in src/run_benchmark.py will run the automated evaluator for a specified prompt and language combination on the MEMERAG dataset.

Usage

python run_benchmark.py [arguments]

Required Arguments --lang: Target language for evaluation (choices: 'en', 'es', 'de', 'fr', 'hi') --dataset_name: The dataset you like to use. Valid values are `memerag` and `memerag_ext_w_majority_vote` --model_id: LLM model to use (choices: "gpt4o_mini", "llama3_2-90b", "llama3_2-11b", "qwen_2_5-32B") --sys_prompt_path: Path to system prompt template file. Sample system prompt can be found in the *prompts* directory --task_prompt_path: Path to task prompt template file. Sample task prompt can be found in the *prompts* directory

Optional Arguments --temperature: Temperature parameter for LLM (default: 0.1) --top_p: Top-p sampling parameter (default: 0.1) --bedrock_region: AWS region for Bedrock models --aws_profile: AWS profile name --num_proc: Number of parallel LLM API calls (default: 4) --num_retries: Number of retry attempts for LLM calls (default: 6)

Supported Evaluator Models

  1. GPT-4o Mini (gpt4o_mini)

    • Requires setting the OPENAI_API_KEY as an environment variable.
  2. Llama 3.2 (11B and 90B) Models

    • llama3_2-90b and llama3_2-11b can be used via AWS Bedrock. Please provide appropriate AWS credentials in ~/.aws/credentials
  3. Qwen 2.5 32B (qwen_2_5-32B)

Example

Run Llama 3.2 90B as an automatic evaluator with AG + COT prompt.

```shell cd src

python runbenchmark.py --lang de --datasetname memerag --modelid llama32-90b \ --syspromptpath prompts/agcot/sysprompt.md --taskpromptpath prompts/agcot/taskprompt.md --num_proc 5 ```

The script prints balanced accuracy and saves judgments to llm_judged_dataset.csv in the specified output directory.

Citation

If you found the benchmark useful, please consider citing our work. {bibtex} @misc{Cruz2025, title={MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation}, author={María Andrea Cruz Blandón and Jayasimha Talur and Bruno Charron and Dong Liu and Saab Mansour and Marcello Federico}, year={2025}, eprint={2502.17163}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.17163}, }

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Issues event: 1
  • Watch event: 3
  • Delete event: 1
  • Push event: 1
  • Public event: 1
  • Pull request event: 2
  • Create event: 1
Last Year
  • Issues event: 1
  • Watch event: 3
  • Delete event: 1
  • Push event: 1
  • Public event: 1
  • Pull request event: 2
  • Create event: 1

Dependencies

requirements.txt pypi
  • annotated-types ==0.7.0
  • anyio ==4.8.0
  • boto3 ==1.37.0
  • botocore ==1.37.0
  • certifi ==2025.1.31
  • charset-normalizer ==3.4.1
  • distro ==1.9.0
  • exceptiongroup ==1.2.2
  • h11 ==0.16.0
  • httpcore ==1.0.7
  • httpx ==0.28.1
  • idna ==3.10
  • jinja2 ==3.1.5
  • jiter ==0.8.2
  • jmespath ==1.0.1
  • joblib ==1.4.2
  • jsonpatch ==1.33
  • jsonpointer ==3.0.0
  • langchain-aws ==0.2.13
  • langchain-core ==0.3.39
  • langsmith ==0.3.11
  • markupsafe ==3.0.2
  • numpy ==1.26.4
  • openai ==1.64.0
  • orjson ==3.10.15
  • packaging ==24.2
  • pandas ==2.2.3
  • pydantic ==2.10.6
  • pydantic-core ==2.27.2
  • python-dateutil ==2.9.0.post0
  • pytz ==2025.1
  • pyyaml ==6.0.2
  • requests ==2.32.3
  • requests-toolbelt ==1.0.0
  • s3transfer ==0.11.2
  • scikit-learn ==1.6.1
  • scipy ==1.15.2
  • six ==1.17.0
  • sniffio ==1.3.1
  • tenacity ==9.0.0
  • threadpoolctl ==3.5.0
  • tqdm ==4.67.1
  • typing-extensions ==4.12.2
  • tzdata ==2025.1
  • urllib3 ==2.3.0
  • zstandard ==0.23.0