https://github.com/amazon-science/auto-rag-eval

Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation"

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Keywords

evaluation genai llm machine-learning

Last synced: 5 months ago · JSON representation

Repository

Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation"

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2405.13622
Size: 374 KB

Statistics

Stars: 80
Watchers: 3
Forks: 13
Open Issues: 3
Releases: 0

Topics

evaluation genai llm machine-learning

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Contributing License Code of conduct

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

This repository is the companion of the ICML 2024 paper Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation (Blog)

Alt Text

Goal: For a given knowledge corpus: * Leverage an LLM to generate an multi-choice exam associated with the task of interest. * Evaluate variants of RaG systems on this exam. * Evaluate and iteratively improve the exam.

The only thing you need to experiment with this code is a json file with your knowledge corpus in the format described bellow.

I - Package Structure

Data: For each use case, contains:
- Preprocessing Code
- Knowledge Corpus Data
- Exam Data (Raw and Processed)
- Retrieval Index
ExamGenerator: Code to generate and process the multi-choice exam using knowledge corpus and LLM generator(s).
ExamEvaluator: Code to evaluate exam using a combination (Retrieval System, LLM, ExamCorpus), relying on lm-harness library.
LLMServer: Unified LLM endpoints to generate the exam.
RetrievalSystems: Unified Retrieval System classes (eg DPR, BM25, Embedding Similarity...).

II - Exam Data Generation Process

We illustrate our methodology on 4 tasks of interest: AWS DevOPS Troubleshooting, StackExchange Q&A, Sec Filings Q&A and Arxiv Q&A. We then show how to adapt the methodology to any task.

StackExchange

Run the commands bellow, where question-date is the data with the raw data generation. Add --save-exam if you want to save the exam and remove it if you're only interested by analytics.

bash cd auto-rag-eval rm -rf Data/StackExchange/KnowledgeCorpus/main/* python3 -m Data.StackExchange.preprocessor python3 -m ExamGenerator.question_generator --task-domain StackExchange python3 -m ExamGenerator.multi_choice_exam --task-domain StackExchange --question-date "question-date" --save-exam

Arxiv

bash cd auto-rag-eval rm -rf Data/Arxiv/KnowledgeCorpus/main/* python3 -m Data.Arxiv.preprocessor python3 -m ExamGenerator.question_generator --task-domain Arxiv python3 -m ExamGenerator.multi_choice_exam --task-domain Arxiv --question-date "question-date" --save-exam

Sec Filings

bash cd auto-rag-eval rm -rf Data/SecFilings/KnowledgeCorpus/main/* python3 -m Data.SecFilings.preprocessor python3 -m ExamGenerator.question_generator --task-domain SecFilings python3 -m ExamGenerator.multi_choice_exam --task-domain SecFilings --question-date "question-date" --save-exam

Add you own task MyOwnTask

Create file structure

bash cd src/llm_automated_exam_evaluation/Data/ mkdir MyOwnTask mkdir MyOwnTask/KnowledgeCorpus mkdir MyOwnTask/KnowledgeCorpus/main mkdir MyOwnTask/RetrievalIndex mkdir MyOwnTask/RetrievalIndex/main mkdir MyOwnTask/ExamData mkdir MyOwnTask/RawExamData

Create documentation corpus

Store in MyOwnTask/KnowledgeCorpus/main a json file, with contains a list of documentation, each with format bellow. See DevOps/html_parser.py, DevOps/preprocessor.py or StackExchange/preprocessor.py for some examples.

bash {'source': 'my_own_source', 'docs_id': 'Doc1022', 'title': 'Dev Desktop Set Up', 'section': 'How to [...]', 'text': "Documentation Text, should be long enough to make informative questions but shorter enough to fit into context", 'start_character': 'N/A', 'end_character': 'N/A', 'date': 'N/A', }

Generate Exam and Retrieval index

First generate the raw exam and the retrieval index. Note that you might need to add support for your own LLM, more on this bellow. You might want to modify the prompt used for the exam generation in LLMExamGenerator class in ExamGenerator/question_generator.py.

bash python3 -m ExamGenerator.question_generator --task-domain MyOwnTask

Once this is done (can take a couple of hours depending on the documentation size), generate the processed exam. To do so, check MyRawExamDate in RawExamData (eg 2023091223) and run:

bash python3 -m ExamGenerator.multi_choice_exam --task-domain MyOwnTask --question-date MyRawExamDate --save-exam

Bring your own LLM

We currently support endpoints for Bedrock (Claude) in LLMServer file. The only thing needed to bring your own is a class, with an inference function that takes a prompt in input and output both the prompt and completed text. Modify LLMExamGenerator class in ExamGenerator/question_generator.py to incorporate it. Different LLM generate different types of questions. Hence, you might want to modify the raw exam parsing in ExamGenerator/multi_choice_questions.py. You can experiment using failed_questions.ipynb notebook from ExamGenerator.

IV - Exam Evaluation Process

We leverage lm-harness package to evaluate the (LLM&Retrieval) system on the generated exam. To do, follow the next steps:

Create a benchmark

Create a benchmark folder for for your task, here DevOpsExam, see ExamEvaluator/DevOpsExam for the template. It contains a code file preprocess_exam,py for prompt templates and more importantly, a set of tasks to evaluate models on:

DevOpsExam contains the tasks associated to ClosedBook (not retrieval) and OpenBook (Oracle Retrieval).
DevOpsRagExam contains the tasks associated to Retrieval variants (DPR/Embeddings/BM25...).

The scripttask_evaluation.sh provided illustrates the evalation of Llamav2:Chat:13B and Llamav2:Chat:70B on the task, using In-Context-Learning (ICL) with respectively 0, 1 and 2 samples.

Citation

To cite this work, please use bash @misc{autorageval2024, title={Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation}, author={Gauthier Guinet and Behrooz Omidvar-Tehrani and Anoop Deoras and Laurent Callot}, year={2024}, eprint={2405.13622}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Watch event: 19
Fork event: 5

Last Year

Watch event: 19
Fork event: 5

Committers

Last synced: 8 months ago

All Time

Total Commits: 10
Total Committers: 2
Avg Commits per committer: 5.0
Development Distribution Score (DDS): 0.1

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Gguinet	4****t	9
Amazon GitHub Automation	5****o	1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 4
Total pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/amazon-science/auto-rag-eval

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

I - Package Structure

II - Exam Data Generation Process

StackExchange

Arxiv

Sec Filings

Add you own task MyOwnTask

Create file structure

Create documentation corpus

Generate Exam and Retrieval index

Bring your own LLM

IV - Exam Evaluation Process

Create a benchmark

Citation

Security

License

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies