rl-llm-calibration-test

Attempt at replication of the parts of the paper "Language models (mostly) know what they know", on open datasets, and models.

https://github.com/aakarsh/rl-llm-calibration-test

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 9 committers (11.1%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.7%) to scientific vocabulary

Keywords

llama2 llm
Last synced: 6 months ago · JSON representation ·

Repository

Attempt at replication of the parts of the paper "Language models (mostly) know what they know", on open datasets, and models.

Basic Info
Statistics
  • Stars: 0
  • Watchers: 3
  • Forks: 1
  • Open Issues: 10
  • Releases: 0
Topics
llama2 llm
Created almost 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Citation

README.md

LLM Calibration Benchmark.

This repository attempts to run benchmarks on some popular openly available language models.

Installation

pip install -r requirements. tx

When running in the colab environment it is recommended ot use.

pip install -r requirements-colab.txt

Unit Tests

Running unit tests requires pytest module invoked as follows:

python -m pytest test

Running Individual Experiments

Any individual experiment can be rerun using the following command

python ../llm_calibration/run_experiment.py --model_name='meta-llama/Llama-2-13b-hf' --dataset='STEM'

The experimental result will produce a json result files which can be parsed offline to generate the requisite plots.

Owner

  • Name: Aakarsh Nair
  • Login: aakarsh
  • Kind: user
  • Location: Portland, OR
  • Company: www.nentei.com

“The present moment is the only moment available to us and it is the door to all other moments.” ~TNH

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Examining Calibration Large Language Model in Question
  Answering
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Aakarsh
    family-names: Nair
    email: aakarsh.nair@student.uni-tuebingen.de
    affiliation: Eberhard Karls Universität Tübingen
  - given-names: Ilinca
    family-names: Vandici
    affiliation: Eberhard Karls Universität Tübingen
  - given-names: Linus
    family-names: Rösener
    affiliation: Eberhard Karls Universität Tübingen
  - {}
repository-code: 'https://github.com/aakarsh/rl-llm-calibration-test'
abstract: >
  In this paper, we examine the issue of calibration of
  large language models. 

  That is the interaction between the \emph{confidence} of a
  predicted answer 

  on a question-answering task with its \emph{empirical
  likelihood 

  of being correct}.


  We replicate elements of previous calibration study
  \cite{kadavath2022language} 

  on several multiple-choice  (MMLU, LogicQA, TruthfulQA)
  and 

  open-ended question answering datasets translated into 

  the multiple choice format (TriviaQA, HumanEval, GSM8k). 


  We find that models do scale in their calibration ability
  by model size. 

  Moreover models fine-tuned for conversation improve in
  calibration and 

  accuracy under multi-shot prompting. However, we also
  observe that for tasks beyond 

  models reasoning  capability (complex logical and
  scientific reasoning) fine-tuning harms models 

  accuracy and leads to overconfidence in model predictions.
license: GPL-1.0+

GitHub Events

Total
Last Year

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 202
  • Total Committers: 9
  • Avg Commits per committer: 22.444
  • Development Distribution Score (DDS): 0.213
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
aakarsh nair a****h@n****m 159
linusroesener 5****r 14
Aakarsh Nair a****r@g****m 13
Bela Linus Roesner l****r@s****e 5
ILINCA VANDICI i****u@g****m 4
Ubuntu u****u@i****l 4
zxoxo45@uni-tuebingen.de t****5@u****n 1
zxoxo45@uni-tuebingen.de t****5@u****n 1
zxoxo45@uni-tuebingen.de t****5@u****n 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 10
  • Total pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • aakarsh (10)
Pull Request Authors
  • ilinkaa (7)
  • aakarsh (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/docker-image.yml actions
  • actions/checkout v3 composite
Dockerfile docker
  • python 3.9 build
requirements-colab.txt pypi
  • accelerate ==0.28.0
  • bitsandbytes ==0.42.0
  • bitsandbytes-cuda110 ==0.26.0.post2
  • datasets ==2.18.0
  • docker-pycreds ==0.4.0
  • docstring_parser ==0.16
  • evaluate ==0.4.1
  • httpx ==0.26.0
  • huggingface-hub ==0.20.3
  • hydra-core ==1.3.2
  • idna ==3.3
  • intervaltree ==3.1.0
  • multidict ==6.0.5
  • multimethod ==1.8
  • multiprocess ==0.70.16
  • nltk ==3.8.1
  • pip ==23.0.1
  • pluggy ==1.0.0
  • ptyprocess ==0.7.0
  • pycodestyle ==2.7.0
  • pycparser ==2.21
  • pycryptodomex ==3.14.1
  • python-dateutil ==2.8.2
  • python-lsp-black ==1.0.0
  • python-lsp-jsonrpc ==1.0.0
  • python-lsp-server ==1.2.4
  • python-slugify ==5.0.2
  • rope ==0.22.0
  • tiktoken ==0.5.2
  • tokenizers ==0.13.3
  • toml ==0.10.2
  • tomli ==2.0.1
  • toolz ==0.11.2
  • transformers ==4.31.0
  • trl ==0.8.1
  • untokenize ==0.1.1
  • urllib3 ==1.26.11
  • visions ==0.7.4
requirements.txt pypi
  • accelerate ==0.28.0
  • bitsandbytes ==0.42.0
  • bitsandbytes-cuda110 ==0.26.0.post2
  • datasets ==2.18.0
  • docker-pycreds ==0.4.0
  • docstring_parser ==0.16
  • evaluate ==0.4.1
  • httpx ==0.26.0
  • huggingface-hub ==0.20.3
  • hydra-core ==1.3.2
  • idna ==3.3
  • intervaltree ==3.1.0
  • ipykernel ==6.9.1
  • ipython ==8.4.0
  • ipython-genutils ==0.2.0
  • jupyter-client ==6.1.12
  • jupyter-core ==4.10.0
  • jupyter-server ==1.18.1
  • jupyterlab ==3.2.9
  • jupyterlab-pygments ==0.1.2
  • jupyterlab-server ==2.12.0
  • matplotlib ==3.5.1
  • matplotlib-inline ==0.1.2
  • multidict ==6.0.5
  • multimethod ==1.8
  • multiprocess ==0.70.16
  • nbclassic ==0.3.5
  • nbclient ==0.5.13
  • nbconvert ==6.4.4
  • nbformat ==5.3.0
  • nltk ==3.8.1
  • np ==1.0.2
  • numba ==0.55.1
  • numpy ==1.21.6
  • numpydoc ==1.4.0
  • pip ==23.0.1
  • plotly ==5.7.0
  • pluggy ==1.0.0
  • psutil ==5.9.0
  • ptyprocess ==0.7.0
  • pycodestyle ==2.7.0
  • pycparser ==2.21
  • pycryptodomex ==3.14.1
  • pytest ==7.1.1
  • python-dateutil ==2.8.2
  • python-lsp-black ==1.0.0
  • python-lsp-jsonrpc ==1.0.0
  • python-lsp-server ==1.2.4
  • python-slugify ==5.0.2
  • pytz ==2022.1
  • rope ==0.22.0
  • safetensors ==0.4.2
  • scikit-image ==0.19.2
  • scikit-learn ==1.1.1
  • scipy ==1.8.0
  • seaborn ==0.11.2
  • tiktoken ==0.5.2
  • tokenizers ==0.13.3
  • toml ==0.10.2
  • tomli ==2.0.1
  • toolz ==0.11.2
  • torch ==1.13.1
  • torchinfo ==1.7.0
  • torchvision ==0.14.1
  • tornado ==6.1
  • transformers ==4.31.0
  • trl ==0.8.1
  • untokenize ==0.1.1
  • urllib3 ==1.26.11
  • visions ==0.7.4