https://github.com/amazon-science/beyondcorrelation

Implementation of the paper: Beyond Correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary

Keywords

autoevaluation correlation llm

Last synced: 6 months ago · JSON representation

Repository

Implementation of the paper: Beyond Correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 691 KB

Statistics

Stars: 7
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0

Topics

autoevaluation correlation llm

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Code of conduct

BeyondCorrelation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

This is the implementation of our paper accepted at ICLR 2025:

BeyondCorrelation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

Authors: Aparna Elangovan Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan Bodapati, Dan Roth

Welcome to the repository for our paper, which takes a deeper look at how well automatic evaluation methods (like LLM-as-a-Judge) align with human evaluations. If you're using automatic metrics to evaluate generative models, this project is for you!

📋 What is the problem?

Automatic evaluation methods, such as LLM-as-a-Judge methods, are often validated by comparing them to human evaluations using correlation scores. However, there's no consensus on which correlation metric to use.

This package offers a comprehensive suite to compute various correlation metrics, to help you diagnose issues with the automatic evaluation.

🔍 Key Insights from Our Work

We uncover two critical observations about the relationship between human and machine evaluations:

When human labels are inconsistent or uncertain (e.g., opinions or preferences), machines can appear to correlate better with the majority human label than humans themselves!
This gives a false impression that automatic evaluations are as good as or better than human agreement. When human labels are consistent, the machine-human correlation drops significantly below human-to-human (HH) agreement.

🚀 Getting Started

GEval Example

We recommend you to try GEval example first:

base PYTHONPATH=. python3 examples/geval_example.py

The script will download SummEval dataset and GEval results, then compute the correlation between machine labels and human labels.

Running the script, you will get the visualization of human labels and machine labels binned by human label median. And the Jason Shannon distance introduced in our paper will also be visualized in the figure.

Besides visualization, the script will also compute various correlation metrics, on data subsets that selected by the human label uncertainty.

| name | proportion | size | Hmediandispersion | Mmediandispersion | krippendorffHH | krippendorffMM | krippendorffH-medianM-median | ... | |-------------------------------------------------|------------|------|---------------------|---------------------|-----------------|-----------------|--------------------------------|-----| | All | 100.0 | 1600 | 4 | 3 | 0.55 | 0.74 | 0.31 | | | humanlabelsnumunique = 1 (perfect agreement) | 12.5 | 200 | 4 | 3 | 1.00 | 0.78 | 0.25 | | | humanlabelsnumunique = 2 | 66.2 | 1059 | 4 | 3 | 0.53 | 0.75 | 0.30 | | | humanlabelsnum_unique = 3 | 21.3 | 341 | 2 | 3 | 0.19 | 0.70 | 0.32 | | | ... | | | | | | | | |

More Examples

For other examples, you will first setup AWS Bedrock access, in order to generate machine labels using LLMs.

You can produce results for datasets on JudgeBench using the following command.

bash PYTHONPATH=. python3 examples/judge_bench_example.py \ --exp_dir judge_bench_topical_chat_experiment \ --dataset_name topical_chat \ --dataset_path judge_bench_topical_chat_experiment/data/topical_chat_short.json \ --llm llama_3_70B

The following command would produce results for SNLI dataset.

bash PYTHONPATH=. python3 examples/snli_example.py \ --exp_dir snli_experiment \ --dataset_path snli_experiment/data/snli_1.0_dev.jsonl \ --llm llama_3_70B

Apply it on your own data

Please follow the Notebook of Toy Example.

Citation

If you use our tools in your project, please cite our work: @inproceedings{ elangovan2025beyond, title={Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and {LLM}-as-a-judge}, author={Aparna Elangovan and Lei Xu and Jongwoo Ko and Mahsa Elyasi and Ling Liu and Sravan Babu Bodapati and Dan Roth}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=E8gYIrbP00} }

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Watch event: 14
Push event: 1
Public event: 1
Pull request event: 2
Fork event: 2

Last Year

Watch event: 14
Push event: 1
Public event: 1
Pull request event: 2
Fork event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amazon-science/beyondcorrelation

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

BeyondCorrelation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

📋 What is the problem?

🔍 Key Insights from Our Work

🚀 Getting Started

GEval Example

More Examples

Apply it on your own data

Citation

Security

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies