https://github.com/amazon-science/factual-confidence-of-llms

Code for paper "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators"

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.6%) to scientific vocabulary

Keywords

confidence factual factuality llm llms robustness

Last synced: 5 months ago · JSON representation

Repository

Code for paper "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators"

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2406.13415
Size: 182 KB

Statistics

Stars: 8
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Topics

confidence factual factuality llm llms robustness

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Code of conduct

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

This repository contains the code used for experiments from: Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators.

Descriptive diagram of a sentence being processed by multiple testing methods

This repository regroups 5 types of Methods used to estimate factual confidence in LLMs, which can then be used to reproduce experiments and test them on question answering datasets: - Verbalised (prompt based) - Trained probe (requires training) - Surrogate token probability (prompt based) - Average sequence probability - Model consistency

We additionally set up a paraphrasing pipeline, using strong filtering to ensure semantic preservation. This allows to test models for a fact across different phrasings and translations.

Getting Started

Installation

The project uses poetry for dependency management and packaging. The latest version and instructions can be found on https://python-poetry.org. official installer: shell curl -sSL https://install.python-poetry.org | python3 -

shell poetry install

Using poetry takes care of all dependencies, and therefore removes the need for requirements.txt. Should you still need that file for any reason, it can be generated using: shell poetry export -f requirements.txt --output requirements.txt --without-hashes

Accelerate

This project uses huggingface's accelerate for GPU management. Feel free to launch accelerate config to get the most out of it.

Usage

data generation pipeline:

Data has at least the following columns: ["text","uuid","is_factual"]. If the paraphrasing option is used, a ["paraphrase"] column will be used.

To prepare the True/False Lama TRex dataset use datasetprep.py, which will create a test and train set in a data folder at root. To experiment with the PopQA dataset : - Download csv file from the following link (tested on 25/06/2024) - run slotfilling.py to get a specific model's ability to correctly answer each question, and generate the ["is_factual"] column

to run experiments:

run training pipeline ("hidden") method
run main.py (all results are saved except for consistency)
run consistency pipeline example scripts: scripts/main.sh, scripts/mainpop.sh, scripts/maintranslated.sh, scripts/mainpiklama.sh for openai results, they are computed by running either evaluation/openaisurrogate.py, evaluation/openaiverbalized.py or datagen/openaisampler.py followed by the consistency pipeline. Don't forget to set the variable in your environment before running. OPENAI_KEY=$mysecretkey

training pipeline - run, in order:

example script: scripts/extracthidden.sh 1. evaluation/extracthiddenlayers.py (runs a given model on a given dataset, and saves the hidden dimensions + labels for training) 2. trainscorer_2 (takes as input the hidden dimensions from previous script, runs gradient descent, saves the resulting model)

consistency pipeline - run, in order:

slot_filling.py (checks, either for popqa or for lama, whether a model outputs the expected answer to a given prompt - serves as labels. If those were generated for previous experiments, skip)
(b) for the lama dataset, an alternative is to run comparative_knowledge.py which tests which of the true fact or the hardest false fact the model is most likely to output. This requires wikidata graphs.
datagen/sampling.py (generates n completions. saves them as csv (raw) and tsv (processed by cleanupsampling function))
evaluation/consistency_utils.py (takes as input the .tsv file, returns a .pt file matching uuids with consistency scores)

example scripts: scripts/sf.sh, scripts/sampling.sh

paraphrasing pipeline:

datagen/paraphrases/genparaphrasing.py (saves a .csv version of the dataset with an additional "paraphrase" column)
run main.py, with the paraphrase flag set to True

to draw graphs from data see:

graphing/draw_graphs.py (bar plots and method correlation plot - further directions commented @ start of doc)
graphing/consistency_analysis.py (get auprc numbers from sampling pipeline, then needs to be manualy added to barplot)
graphing/paraphgraphutils.py (computes micro-average across paraphrases, macro-average, and normalized standard deviation)

References

Please cite as [1].

[1] M. Mahaut, L. Aina, P. Czarnowska, M. Hardalov, T. Müller, L. Màrquez "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators" Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024.

@inproceedings{mahaut-etal-2024-factual, title = "Factual Confidence of {LLM}s: on Reliability and Robustness of Current Estimators", author = {Mahaut, Mat{\'e}o and Aina, Laura and Czarnowska, Paula and Hardalov, Momchil and M{\"u}ller, Thomas and M{\`a}rquez, Llu{\'\i}s", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics", year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.250", pages = "4554--4570" }

License

This project is licensed under the Apache-2.0 License.

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Watch event: 7
Push event: 1

Last Year

Watch event: 7
Push event: 1

Dependencies

pyproject.toml pypi

black ^23.10.0 develop
flake8 ^6.1.0 develop
isort ^5.12.0 develop
matplotlib ^3.7 develop
pytest ^7.4.2 develop
accelerate ^0.23.0
bitsandbytes ^0.42.0
boto3 ^1.34.15
datasets ^2.14.5
deepspeed ^0.11.1
numpy ^1.2
openai ^1.6.1
pandas ^2.1.4
protobuf ^4.25.1
python ^3.9
scikit-learn ^1.3.1
seaborn ^0.13.0
sentencepiece ^0.1.99
spacy ^3.7.2
torch ^2.0.1
torcheval ^0.0.7
transformers ^4.34.0
typer ^0.9.0
word2num ^0.1.1
word2number ^1.1

requirements.txt pypi

accelerate *
peft *
torch *
transformers *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amazon-science/factual-confidence-of-llms

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Getting Started

Installation

Accelerate

Usage

data generation pipeline:

to run experiments:

training pipeline - run, in order:

consistency pipeline - run, in order:

paraphrasing pipeline:

to draw graphs from data see:

References

License

Owner

GitHub Events

Total

Last Year

Dependencies