openqa-eval

ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large Language Models

https://github.com/ehsk/openqa-eval

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Keywords

evaluation large-language-models open-domain-qa openai-api question-answering

Last synced: 6 months ago · JSON representation ·

Repository

ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large Language Models

Basic Info

Host: GitHub
Owner: ehsk
License: mit
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2305.06984
Size: 621 KB

Statistics

Stars: 46
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Topics

evaluation large-language-models open-domain-qa openai-api question-answering

Created almost 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

Evaluating Open-Domain Question Answering in the Era of Large Language Models

This repository hosts the code for our ACL 2023 paper: https://arxiv.org/abs/2305.06984.

Overview

Lexical matching is the standard evaluation method for open-domain question answering (QA), but it fails when plausible answers are not in the provided list. In this study, we manually examined the answers of several open-domain QA models and found that

The true performance of all models is severely underestimated by lexical matching;
The performance of LLMs increases by nearly +60\%, and the few-shot LLM (InstructGPT text-davinci-003) actually achieves state-of-the-art;
Automated evaluation methods (BERT-based or LLM-based) are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers usually generated by LLMs.

Please see our paper for more details.

Requirements

The code needs Python 3.8+ (we tested it with Python 3.8).

To install from the repo: bash pip install git+https://github.com/ehsk/OpenQA-eval.git

To install from the source: bash git clone git@github.com:ehsk/OpenQA-eval.git pip install -e .

Data

We worked on the Natural Questions-open (Lee et al., ACL 2019) test dataset that consists of 3,610 questions. We randomly sampled 301 questions for annotation.

In the data directory, we provide the answers generated by all open-domain QA models along with the output of the four evaluation mechanisms, described in the paper:

bash data ├── model_outputs # Answers generated by 12 open-domain QA models │ ├── NQ301_text-davinci-003_fewshot-n64.jsonl # InstructGPT (few-shot) │ ├── NQ301_text-davinci-003_zeroshot.jsonl # InstructGPT (zero-shot) │ ├── NQ_ANCE-plus_FiD.jsonl # ANCE+ & Fusion-In-Decoder │ └── ... ├── NQ301_BEM.tsv # BEM predictions for all generated answers ├── NQ301_gpt-4.tsv # GPT4-eval output for all generated answers ├── NQ301_human.tsv # Human judgments for all generated answers └── NQ301_text-davinci-003.tsv # InstructGPT-eval output for all generated answers

The annotations can also be viewed online here.

Evaluation

The evaluation script takes a prediction file in a jsonl format as below and measures its performance with different metrics.

json lines {"question": "who is under the mask of darth vader", "answer": ["Anakin Skywalker"], "prediction": "Anakin Skywalker"} {"question": "which is the default file extension for an audio file in windows media player", "answer": ["Windows Playlist ( WPL )"], "prediction": "WMA"}

The following command computes only two lexical matching metrics: EM (Exact-Match accuracy) and macro-averaged F1. bash python -m oqaeval /path/to/prediction_file.jsonl

To evaluate using an LLM like InstructGPT-eval in the paper, the model name (text-davinci-003 or gpt-4) argument should be passed: bash python -m oqaeval /path/to/prediction_file.jsonl --model text-davinci-003 which calls OpenAI APIs. The environment variable OPENAI_API_KEY needs to be set first. Bear in mind that running this command will result in charges to your OpenAI account. We did not see a significant difference between GPT-4 and InstructGPT, so we recommend using the cheaper model (InstructGPT).

To evaluate using our provided annotated files including human judgment, you can simply run: bash python -m oqaeval /path/to/prediction_file.jsonl --annotation data/NQ301_human.tsv The above command evaluates only 301 annotated questions and skips the rest in the prediction file.

Bugs:bug: or questions:question:

If you have any questions or encounter any problems, feel free to open an issue.

Citation

If you want to cite our papers, please use:

bibtex @article{kamalloo2023evaluating, title = "{Evaluating Open-Domain Question Answering in the Era of Large Language Models}", author = {Kamalloo, Ehsan and Dziri, Nouha and Clarke, Charles L. A. and Rafiei, Davood}, journal={arXiv preprint arXiv:2305.06984}, year={2018} }

License

This work is licensed under the MIT license. See LICENSE for details.

Owner

Name: Ehsan
Login: ehsk
Kind: user
Location: Waterloo, Canada

Website: https://ehsk.github.io
Twitter: ehsk0
Repositories: 4
Profile: https://github.com/ehsk

Postdoc at University of Waterloo, PhD from University of Alberta.

Citation (CITATION.cff)

cff-version: "1.2.0"
date-released: 2023-05
message: "If you use our work, please cite it using these metadata."
title: "Evaluating Open-Domain Question Answering in the Era of Large Language Models"
url: "https://github.com/ehsk/OpenQA-eval"
authors:
  - family-names: Kamalloo
    given-names: Ehsan
  - family-names: Dziri
    given-names: Nouha
  - family-names: L. A. Clarke
    given-names: Charles
  - family-names: Rafiei
    given-names: Davood
preferred-citation:
  type: conference-paper
  authors:
  - family-names: Kamalloo
    given-names: Ehsan
  - family-names: Dziri
    given-names: Nouha
  - family-names: L. A. Clarke
    given-names: Charles
  - family-names: Rafiei
    given-names: Davood
  month: 07
  title: "Evaluating Open-Domain Question Answering in the Era of Large Language Models"
  year: 2023
  url:  "https://arxiv.org/abs/2305.06984"
  booktitle: "Association for Computational Linguistics"

GitHub Events

Total

Watch event: 6

Last Year

Watch event: 6

Committers

Last synced: about 2 years ago

All Time

Total Commits: 4
Total Committers: 1
Avg Commits per committer: 4.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 4
Committers: 1
Avg Commits per committer: 4.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Ehsan	e**o@g**m	4

Issues and Pull Requests

Last synced: about 2 years ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: 1 day
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: 1 day
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

openqa-eval

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Quick Links

Overview

Requirements

Data

Evaluation

Bugs:bug: or questions:question:

Citation

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels