grouse

Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models

https://github.com/illuin-tech/grouse

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models

Basic Info

Host: GitHub
Owner: illuin-tech
License: mit
Language: Python
Default Branch: main
Size: 1.26 MB

Statistics

Stars: 12
Watchers: 6
Forks: 3
Open Issues: 0
Releases: 8

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Citation

GroUSE

Evaluate Grounded Question Answering (GQA) models and GQA evaluator models. We implement the evaluation methods described in GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering.

Install
Command Line Usage
Python Usage
Links
Citation

Install

bash pip install grouse

Then, setup your OpenAI credentials by creating an .env file by copying the .env.dist file, filling in your OpenAI API key and organization id and exporting the environment variables export $(cat .env | xargs).

Command Line Usage

Evaluation of the Grounded Question Answering task

You can build a dataset in a jsonl file with the following format per line:

json { "references": ["", ...], // List of references "input": "", // Query "actual_output": "", // Predicted answer generated by the model we want to evaluate "expected_output": "" // Ground truth answer to the input }

You can also check this example example_data/grounded_qa.jsonl.

Then, run this command:

bash grouse evaluate {PATH_TO_DATASET_WITH_GENERATIONS} outputs/gpt-4o

We recommend using GPT-4 as an evaluator model as we optimised prompts for this model, but you can change the model and prompts using the otional arguments : - --evaluator_model_name: Name of the evaluator model. It can be any LiteLLM model. The default model is GPT-4. - --prompts_path: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4.

Unit Testing of Evaluators with GroUSE

Meta-Evaluation consists in evaluating GQA evaluators with the GroUSE unit tests.

bash grouse meta-evaluate gpt-4o meta-outputs/gpt-4o

Optional arguments : - --prompts_path: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4. - --train_set: Optional flag to meta-evaluate on the train set (16 tests) instead of the test set (144 tests). The train set is meant to be used during the prompt engineering phase.

Plot Matrices of unit tests success

You can plot the results of unit tests in the shape of matrices:

bash grouse plot meta-outputs/gpt-4o

The resulting matrices look like this:

result_matrices_plot

Python Usage

```python from grouse import EvaluationSample, GroundedQAEvaluator

sample = EvaluationSample( input="What is the capital of France?", # Replace this with the actual output from your LLM application actualoutput="The capital of France is Marseille.[1]", expectedoutput="The capital of France is Paris.[1]", references=["Paris is the capital of France."] ) evaluator = GroundedQAEvaluator() evaluator.evaluate([sample]) ```

Tutorial

You can check this tutorial to get started on some examples.

Citation

latex @misc{muller2024grousebenchmarkevaluateevaluators, title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering}, author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud}, year={2024}, eprint={2409.06595}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.06595}, }

Owner

Name: ILLUIN Technology
Login: illuin-tech
Kind: organization
Email: contact@illuin.tech
Location: Paris, France

Website: www.illuin.tech
Repositories: 8
Profile: https://github.com/illuin-tech

Illuin Technology est une équipe de makers motivé·e·s par les challenges de l'IA et les nouveaux modes utilisateurs que cette intelligence permet.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Muller"
  given-names: "Sacha"
  email: "sacha.muller@illuin.tech"
- family-names: "Loison"
  given-names: "António"
  email: "antonio.loison@illuin.tech"
- family-names: "Omrani"
  given-names: "Bilel"
  email: "bilel.omrani@illuin.tech"
- family-names: "Viaud"
  given-names: "Gautier"
  email: "gautier.viaud@illuin.tech"
title: "GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering"
date-released: 2024-09-11
url: "https://github.com/illuin-tech/grouse"
preferred-citation:
  type: article
  authors:
  - family-names: "Muller"
    given-names: "Sacha"
    email: "sacha.muller@illuin.tech"
  - family-names: "Loison"
    given-names: "António"
    email: "antonio.loison@illuin.tech"
  - family-names: "Omrani"
    given-names: "Bilel"
    email: "bilel.omrani@illuin.tech"
  - family-names: "Viaud"
    given-names: "Gautier"
  doi: "arXiv.2409.06595"
  month: 9
  title: "GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering"
  year: 2024
  url: "https://arxiv.org/pdf/2409.06595"

GitHub Events

Total

Release event: 1
Push event: 2
Fork event: 1
Create event: 2

Last Year

Release event: 1
Push event: 2
Fork event: 1
Create event: 2

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 0
Total pull requests: 8
Average time to close issues: N/A
Average time to close pull requests: 2 days
Total issue authors: 0
Total pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 8
Average time to close issues: N/A
Average time to close pull requests: 2 days
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

antonioloison (12)
sachamuller (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 71 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 9
Total maintainers: 1

pypi.org: grouse

Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models.

Documentation: https://grouse.readthedocs.io/
License: mit
Latest release: 0.4.2
published over 1 year ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 71 Last month

Rankings

Dependent packages count: 10.5%

Average: 34.7%

Dependent repos count: 58.9%

Maintainers (1)

antoniol