https://github.com/ai4bharat/fbi

FBI: Finding Blindspots in LLM Evaluations with Interpretable Checklists

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary

Last synced: 6 months ago · JSON representation

Repository

FBI: Finding Blindspots in LLM Evaluations with Interpretable Checklists

Basic Info

Host: GitHub
Owner: AI4Bharat
Language: Python
Default Branch: main
Size: 8.12 MB

Statistics

Stars: 29
Watchers: 2
Forks: 3
Open Issues: 1
Releases: 0

Created almost 2 years ago · Last pushed 7 months ago

Metadata Files

Readme

FBI: Finding Blindspots in Evaluator LLMs with Interpretable Checklists

🏆 EMNLP 2024 Outstanding Paper Award 🏆

📜 Paper | 🤗 HF Dataset

We present FBI, our novel meta-evaluation framework designed to assess the robustness of evaluator LLMs across diverse tasks and evaluation strategies. Please refer to our paper and blog for more details.

We present FBI, our novel meta-evaluation framework designed to assess the robustness of evaluator LLMs across diverse tasks and evaluation strategies.

Setup

To run perturbation generation and evaluation, you need to install the required packages by running the following command:

bash pip install -r requirements.txt

Setup required API keys to make the API calls. bash export OPENAI_API_KEY=[ADD_YOUR_OPENAI_API_KEY] export CLAUDE_API_KEY=[ADD_YOUR_CLAUDE_API_KEY] export GEMINI_API_KEY=[ADD_YOUR_GEMINI_API_KEY] export LLAMA3_API_KEY=[ADD_LLAMA3_API_KEY] export LLAMA3_BASE_URL=[ADD_LLAMA3_BASE_URL] We use hosted services for using Llama-3-70B-Instruct Model.

Generate Perturbations

To generate perturbations, specify the desired task ability, model, and other parameters. This will create a FILENAME.jsonl, that can be used to create a Batch API call to get the model outputs.

Sample perturbations across all the 22 categories are available at here!!!

Use the following command to generate the jsonl: bash python -m perturbations.perturbations \ --data_dir DATA_DIR \ --file_name PATH_TO_PROMPTS_AND_ANSWERS \ --subset TASK_ABILITY \ --model MODEL_NAME \ --temperature TEMP \ --top_p TOP_P \ --max_tokens MAX_TOKENS \ --frequency_penalty FREQ_PEN \ --debug

DATA_DIR: Directory containing the data.
PATH_TO_PROMPTS_AND_ANSWERS: Path to the file with prompts and answers.
TASK_ABILITY: Specify the type of task (choose from ['factual', 'reasoning', 'instruction-following', 'long-form']).
MODEL_NAME: Name of the model to use.
TEMP: Sampling temperature (controls the randomness of predictions).
TOP_P: Top-p sampling parameter (controls the diversity of predictions).
MAX_TOKENS: Maximum number of tokens to generate.
FREQ_PEN: Frequency penalty (controls the repetition of tokens).
--debug: Optional flag to enable debug mode.

Once you have the .jsonl file, create a batch request using the following command:

bash python batch_call.py \ --create_batch \ --input_file_name FILENAME.jsonl \ --data_path DATA_DIR \ --job_desc "JOB_DESC"

--create_batch: Flag to indicate the creation of a batch request.
FILENAME.jsonl: The name of the input file (in .jsonl format) containing the outputs.
DATA_DIR: Path to the directory where the data is stored.
JOB_DESC: Description of the job for the batch request.

Run Evaluation

To run evaluations, we again create a batch jsonl file FILENAME.jsonl and use this for getting model outputs either using Batch API or regular API. We support the following LLM Evaluation strategies (please refer to the paper for more details on each strategy):

Single Answer Evaluation

Vanilla*: Scoring the answer without any explanation. (single_vanilla)
Vanilla: Scoring the answer by first generating the explanation. (single_vanilla_cot)
Rubric: Scoring the answer with the given rubrics. (single_rubrics)
Axis: Scoring the given answer along a specified axis. (single_axes)
Axis+Rubrics: Scoring the given answer along a specified axis with the given rubrics. (single_axes_rubrics)

Pairwise Comparison

Pairwise*: Choosing the best answer without any explanation. (compare_vanilla)
Pairwise: Choosing the best answer by first generating the explanation. (compare_vanilla_cot)
Rules: Choosing the best answer according to the given rules. (compare_rules)
Axis: Choosing the best answer along the specified axis. (compare_axes)
Axis+Rules: Choosing the best answer along the specified answer and according to the given rules. (compare_axes_rules)

Reference-guided Evaluation

Reference: Scoring the answer, given the reference answer by first generating an explanation. (reference_based)

Use the below command to generate the jsonl file: bash python llm_evaluators/<EVAL_METHOD>.py \ --file_name PATH_TO_TSV_FILE_HAVING_DATA \ --out_file_name PATH_TO_OUTPUT_JSONL \ --model MODEL_NAME \ --all \ --axes "cont_qual" "task_qual" \ --p_mode - file_name: The path of the input data file to be evaluated. (We support data in tsv file format only right now) - out_file_name: The path of the output JSONL file where the jsonl will be saved. - model: The name of the model to be used for evaluation. Choices: ['gpt-4o', 'gpt-4-turbo', 'gpt-3.5-turbo-0125', 'llama3-70b', 'claude3-opus', 'gemini-1.5-flash', 'gemini-1.5-pro'] - all: Run all available metrics. This option cannot be used together with --axes. (Supported only in Axis, Axis+Rubrics and Axis+Rules evaluators only) - axes: Run specific metrics. Provide a list of metric names to be evaluated. This option cannot be used together with --all. (Supported only in Axis, Axis+Rubrics and Axis+Rules evaluators only) - p_mode: Run in perturbed-first mode i.e., exchange the positions of gold answers and perturbed answers. (Supported only in Pairwise Comparison evaluators only)

Once the .jsonl file is created, we can create a batch request (supported by only OpenAI models) using the command: bash python batch_call.py \ --create_batch \ --input_file_name FILENAME.jsonl \ --data_path DATA_DIR \ --job_desc "JOB_DESC"

--create_batch: Flag to indicate the creation of a batch request.
FILENAME.jsonl: The name of the input file (in .jsonl format) containing the outputs.
DATA_DIR: Path to the directory where the data is stored.
JOB_DESC: Description of the job for the batch request.

For other models and for running in regular API mode, use the command: bash python parallel_call.py \ --input_file_name FILENAME.jsonl \ --output_file_name PATH_FOR_OUTPUT_JSONL \ --n_jobs NUMBER_OF_PARALLEL_REQUESTS

input_file_name: The path of the input file (in .jsonl format).
output_file_name: The path of the output file (for storing the outputs)
n_jobs: Number of parallel requests to send to the model API.

Citation

If you used this repository or our models, please cite our work:

bibtex @inproceedings{doddapaneni-etal-2024-finding, title = "Finding Blind Spots in Evaluator {LLM}s with Interpretable Checklists", author = "Doddapaneni, Sumanth and Khan, Mohammed Safi Ur Rahman and Verma, Sshubam and Khapra, Mitesh M", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-main.911/", doi = "10.18653/v1/2024.emnlp-main.911", pages = "16279--16309", abstract = "Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50{\%} of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. \textit{These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications.}" }

Owner

Name: AI4Bhārat
Login: AI4Bharat
Kind: organization
Email: opensource@ai4bharat.org
Location: India

Website: https://ai4bharat.org
Twitter: AI4Bharat
Repositories: 37
Profile: https://github.com/AI4Bharat

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total

Watch event: 9
Push event: 4
Fork event: 1

Last Year

Watch event: 9
Push event: 4
Fork event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 13
Total pull requests: 1
Average time to close issues: 12 days
Average time to close pull requests: about 2 hours
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

safikhanSoofiyani (13)

Pull Request Authors

safikhanSoofiyani (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

langchain *
openai *
pandas *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ai4bharat/fbi

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

FBI: Finding Blindspots in Evaluator LLMs with Interpretable Checklists

Setup

Generate Perturbations

Run Evaluation

Single Answer Evaluation

Pairwise Comparison

Reference-guided Evaluation

Citation

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies