https://github.com/amazon-science/factual-confidence-of-llms

Code for paper "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators"

https://github.com/amazon-science/factual-confidence-of-llms

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.6%) to scientific vocabulary

Keywords

confidence factual factuality llm llms robustness
Last synced: 5 months ago · JSON representation

Repository

Code for paper "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators"

Basic Info
Statistics
  • Stars: 8
  • Watchers: 0
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
confidence factual factuality llm llms robustness
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

This repository contains the code used for experiments from: Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators.

Descriptive diagram of a sentence being processed by multiple testing methods

This repository regroups 5 types of Methods used to estimate factual confidence in LLMs, which can then be used to reproduce experiments and test them on question answering datasets: - Verbalised (prompt based) - Trained probe (requires training) - Surrogate token probability (prompt based) - Average sequence probability - Model consistency

We additionally set up a paraphrasing pipeline, using strong filtering to ensure semantic preservation. This allows to test models for a fact across different phrasings and translations.

Getting Started

Installation

The project uses poetry for dependency management and packaging. The latest version and instructions can be found on https://python-poetry.org. official installer: shell curl -sSL https://install.python-poetry.org | python3 -

shell poetry install

Using poetry takes care of all dependencies, and therefore removes the need for requirements.txt. Should you still need that file for any reason, it can be generated using: shell poetry export -f requirements.txt --output requirements.txt --without-hashes

Accelerate

This project uses huggingface's accelerate for GPU management. Feel free to launch accelerate config to get the most out of it.

Usage

data generation pipeline:

Data has at least the following columns: ["text","uuid","is_factual"]. If the paraphrasing option is used, a ["paraphrase"] column will be used.

To prepare the True/False Lama TRex dataset use datasetprep.py, which will create a test and train set in a data folder at root. To experiment with the PopQA dataset : - Download csv file from the following link (tested on 25/06/2024) - run slotfilling.py to get a specific model's ability to correctly answer each question, and generate the ["is_factual"] column

to run experiments:

  1. run training pipeline ("hidden") method
  2. run main.py (all results are saved except for consistency)
  3. run consistency pipeline example scripts: scripts/main.sh, scripts/mainpop.sh, scripts/maintranslated.sh, scripts/mainpiklama.sh for openai results, they are computed by running either evaluation/openaisurrogate.py, evaluation/openaiverbalized.py or datagen/openaisampler.py followed by the consistency pipeline. Don't forget to set the variable in your environment before running. OPENAI_KEY=$mysecretkey

training pipeline - run, in order:

example script: scripts/extracthidden.sh 1. evaluation/extracthiddenlayers.py (runs a given model on a given dataset, and saves the hidden dimensions + labels for training) 2. trainscorer_2 (takes as input the hidden dimensions from previous script, runs gradient descent, saves the resulting model)

consistency pipeline - run, in order:

  1. slot_filling.py (checks, either for popqa or for lama, whether a model outputs the expected answer to a given prompt - serves as labels. If those were generated for previous experiments, skip)
  2. (b) for the lama dataset, an alternative is to run comparative_knowledge.py which tests which of the true fact or the hardest false fact the model is most likely to output. This requires wikidata graphs.
  3. datagen/sampling.py (generates n completions. saves them as csv (raw) and tsv (processed by cleanupsampling function))
  4. evaluation/consistency_utils.py (takes as input the .tsv file, returns a .pt file matching uuids with consistency scores)

example scripts: scripts/sf.sh, scripts/sampling.sh

paraphrasing pipeline:

  • datagen/paraphrases/genparaphrasing.py (saves a .csv version of the dataset with an additional "paraphrase" column)
  • run main.py, with the paraphrase flag set to True

to draw graphs from data see:

  • graphing/draw_graphs.py (bar plots and method correlation plot - further directions commented @ start of doc)
  • graphing/consistency_analysis.py (get auprc numbers from sampling pipeline, then needs to be manualy added to barplot)
  • graphing/paraphgraphutils.py (computes micro-average across paraphrases, macro-average, and normalized standard deviation)

References

Please cite as [1].

[1] M. Mahaut, L. Aina, P. Czarnowska, M. Hardalov, T. Müller, L. Màrquez "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators" Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024.

@inproceedings{mahaut-etal-2024-factual, title = "Factual Confidence of {LLM}s: on Reliability and Robustness of Current Estimators", author = {Mahaut, Mat{\'e}o and Aina, Laura and Czarnowska, Paula and Hardalov, Momchil and M{\"u}ller, Thomas and M{\`a}rquez, Llu{\'\i}s", booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics", year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.acl-long.250", pages = "4554--4570" }

License

  • This project is licensed under the Apache-2.0 License.

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Watch event: 7
  • Push event: 1
Last Year
  • Watch event: 7
  • Push event: 1

Dependencies

pyproject.toml pypi
  • black ^23.10.0 develop
  • flake8 ^6.1.0 develop
  • isort ^5.12.0 develop
  • matplotlib ^3.7 develop
  • pytest ^7.4.2 develop
  • accelerate ^0.23.0
  • bitsandbytes ^0.42.0
  • boto3 ^1.34.15
  • datasets ^2.14.5
  • deepspeed ^0.11.1
  • numpy ^1.2
  • openai ^1.6.1
  • pandas ^2.1.4
  • protobuf ^4.25.1
  • python ^3.9
  • scikit-learn ^1.3.1
  • seaborn ^0.13.0
  • sentencepiece ^0.1.99
  • spacy ^3.7.2
  • torch ^2.0.1
  • torcheval ^0.0.7
  • transformers ^4.34.0
  • typer ^0.9.0
  • word2num ^0.1.1
  • word2number ^1.1
requirements.txt pypi
  • accelerate *
  • peft *
  • torch *
  • transformers *