https://github.com/amazon-science/factual-confidence-of-llms
Code for paper "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators"
https://github.com/amazon-science/factual-confidence-of-llms
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.6%) to scientific vocabulary
Keywords
Repository
Code for paper "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators"
Basic Info
- Host: GitHub
- Owner: amazon-science
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2406.13415
- Size: 182 KB
Statistics
- Stars: 8
- Watchers: 0
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators
This repository contains the code used for experiments from: Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators.

This repository regroups 5 types of Methods used to estimate factual confidence in LLMs, which can then be used to reproduce experiments and test them on question answering datasets: - Verbalised (prompt based) - Trained probe (requires training) - Surrogate token probability (prompt based) - Average sequence probability - Model consistency
We additionally set up a paraphrasing pipeline, using strong filtering to ensure semantic preservation. This allows to test models for a fact across different phrasings and translations.
Getting Started
Installation
The project uses poetry for dependency management and packaging. The latest version and instructions can be
found on https://python-poetry.org.
official installer:
shell
curl -sSL https://install.python-poetry.org | python3 -
shell
poetry install
Using poetry takes care of all dependencies, and therefore removes the need for requirements.txt. Should you still need that file for any reason, it can be generated using:
shell poetry export -f requirements.txt --output requirements.txt --without-hashes
Accelerate
This project uses huggingface's accelerate for GPU management. Feel free to launch accelerate config to get the most out of it.
Usage
data generation pipeline:
Data has at least the following columns: ["text","uuid","is_factual"]. If the paraphrasing option is used, a ["paraphrase"] column will be used.
To prepare the True/False Lama TRex dataset use datasetprep.py, which will create a test and train set in a data folder at root. To experiment with the PopQA dataset : - Download csv file from the following link (tested on 25/06/2024) - run slotfilling.py to get a specific model's ability to correctly answer each question, and generate the ["is_factual"] column
to run experiments:
- run training pipeline ("hidden") method
- run main.py (all results are saved except for consistency)
- run consistency pipeline example scripts: scripts/main.sh, scripts/mainpop.sh, scripts/maintranslated.sh, scripts/mainpiklama.sh for openai results, they are computed by running either evaluation/openaisurrogate.py, evaluation/openaiverbalized.py or datagen/openaisampler.py followed by the consistency pipeline. Don't forget to set the variable in your environment before running. OPENAI_KEY=$mysecretkey
training pipeline - run, in order:
example script: scripts/extracthidden.sh 1. evaluation/extracthiddenlayers.py (runs a given model on a given dataset, and saves the hidden dimensions + labels for training) 2. trainscorer_2 (takes as input the hidden dimensions from previous script, runs gradient descent, saves the resulting model)
consistency pipeline - run, in order:
- slot_filling.py (checks, either for popqa or for lama, whether a model outputs the expected answer to a given prompt - serves as labels. If those were generated for previous experiments, skip)
- (b) for the lama dataset, an alternative is to run comparative_knowledge.py which tests which of the true fact or the hardest false fact the model is most likely to output. This requires wikidata graphs.
- datagen/sampling.py (generates n completions. saves them as csv (raw) and tsv (processed by cleanupsampling function))
- evaluation/consistency_utils.py (takes as input the .tsv file, returns a .pt file matching uuids with consistency scores)
example scripts: scripts/sf.sh, scripts/sampling.sh
paraphrasing pipeline:
- datagen/paraphrases/genparaphrasing.py (saves a .csv version of the dataset with an additional "paraphrase" column)
- run main.py, with the paraphrase flag set to True
to draw graphs from data see:
- graphing/draw_graphs.py (bar plots and method correlation plot - further directions commented @ start of doc)
- graphing/consistency_analysis.py (get auprc numbers from sampling pipeline, then needs to be manualy added to barplot)
- graphing/paraphgraphutils.py (computes micro-average across paraphrases, macro-average, and normalized standard deviation)
References
Please cite as [1].
[1] M. Mahaut, L. Aina, P. Czarnowska, M. Hardalov, T. Müller, L. Màrquez "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators" Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024.
@inproceedings{mahaut-etal-2024-factual,
title = "Factual Confidence of {LLM}s: on Reliability and Robustness of Current Estimators",
author = {Mahaut, Mat{\'e}o and
Aina, Laura and
Czarnowska, Paula and
Hardalov, Momchil and
M{\"u}ller, Thomas and
M{\`a}rquez, Llu{\'\i}s",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics",
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.250",
pages = "4554--4570"
}
License
- This project is licensed under the Apache-2.0 License.
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
- Watch event: 7
- Push event: 1
Last Year
- Watch event: 7
- Push event: 1
Dependencies
- black ^23.10.0 develop
- flake8 ^6.1.0 develop
- isort ^5.12.0 develop
- matplotlib ^3.7 develop
- pytest ^7.4.2 develop
- accelerate ^0.23.0
- bitsandbytes ^0.42.0
- boto3 ^1.34.15
- datasets ^2.14.5
- deepspeed ^0.11.1
- numpy ^1.2
- openai ^1.6.1
- pandas ^2.1.4
- protobuf ^4.25.1
- python ^3.9
- scikit-learn ^1.3.1
- seaborn ^0.13.0
- sentencepiece ^0.1.99
- spacy ^3.7.2
- torch ^2.0.1
- torcheval ^0.0.7
- transformers ^4.34.0
- typer ^0.9.0
- word2num ^0.1.1
- word2number ^1.1
- accelerate *
- peft *
- torch *
- transformers *