rl-llm-calibration-test

Attempt at replication of the parts of the paper "Language models (mostly) know what they know", on open datasets, and models.

https://github.com/aakarsh/rl-llm-calibration-test

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 9 committers (11.1%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.7%) to scientific vocabulary

Keywords

llama2 llm

Last synced: 10 months ago · JSON representation ·

Repository

Attempt at replication of the parts of the paper "Language models (mostly) know what they know", on open datasets, and models.

Basic Info

Host: GitHub
Owner: aakarsh
Language: Jupyter Notebook
Default Branch: main
Homepage: https://arxiv.org/pdf/2207.05221.pdf
Size: 24.7 MB

Statistics

Stars: 0
Watchers: 3
Forks: 1
Open Issues: 10
Releases: 0

Topics

llama2 llm

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme Citation

README.md

LLM Calibration Benchmark.

This repository attempts to run benchmarks on some popular openly available language models.

Installation

pip install -r requirements. tx

When running in the colab environment it is recommended ot use.

pip install -r requirements-colab.txt

Unit Tests

Running unit tests requires pytest module invoked as follows:

python -m pytest test

Running Individual Experiments

Any individual experiment can be rerun using the following command

python ../llm_calibration/run_experiment.py --model_name='meta-llama/Llama-2-13b-hf' --dataset='STEM'

The experimental result will produce a json result files which can be parsed offline to generate the requisite plots.

Owner

Name: Aakarsh Nair
Login: aakarsh
Kind: user
Location: Portland, OR
Company: www.nentei.com

Website: https://www.aakarsh.io
Twitter: aakarsh
Repositories: 365
Profile: https://github.com/aakarsh

“The present moment is the only moment available to us and it is the door to all other moments.” ~TNH

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Examining Calibration Large Language Model in Question
  Answering
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Aakarsh
    family-names: Nair
    email: aakarsh.nair@student.uni-tuebingen.de
    affiliation: Eberhard Karls Universität Tübingen
  - given-names: Ilinca
    family-names: Vandici
    affiliation: Eberhard Karls Universität Tübingen
  - given-names: Linus
    family-names: Rösener
    affiliation: Eberhard Karls Universität Tübingen
  - {}
repository-code: 'https://github.com/aakarsh/rl-llm-calibration-test'
abstract: >
  In this paper, we examine the issue of calibration of
  large language models. 

  That is the interaction between the \emph{confidence} of a
  predicted answer 

  on a question-answering task with its \emph{empirical
  likelihood 

  of being correct}.


  We replicate elements of previous calibration study
  \cite{kadavath2022language} 

  on several multiple-choice  (MMLU, LogicQA, TruthfulQA)
  and 

  open-ended question answering datasets translated into 

  the multiple choice format (TriviaQA, HumanEval, GSM8k). 


  We find that models do scale in their calibration ability
  by model size. 

  Moreover models fine-tuned for conversation improve in
  calibration and 

  accuracy under multi-shot prompting. However, we also
  observe that for tasks beyond 

  models reasoning  capability (complex logical and
  scientific reasoning) fine-tuning harms models 

  accuracy and leads to overconfidence in model predictions.
license: GPL-1.0+

GitHub Events

Total

Last Year

Committers

Last synced: 11 months ago

All Time

Total Commits: 202
Total Committers: 9
Avg Commits per committer: 22.444
Development Distribution Score (DDS): 0.213

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
aakarsh nair	a**h@n**m	159
linusroesener	5****r	14
Aakarsh Nair	a**r@g**m	13
Bela Linus Roesner	l**r@s**e	5
ILINCA VANDICI	i**u@g**m	4
Ubuntu	u**u@i**l	4
zxoxo45@uni-tuebingen.de	t**5@u**n	1
zxoxo45@uni-tuebingen.de	t**5@u**n	1
zxoxo45@uni-tuebingen.de	t**5@u**n	1

Committer Domains (Top 20 + Academic)

uc2n273.localdomain: 1 uc2n512.localdomain: 1 uc2n996.localdomain: 1 ip-172-31-7-187.us-west-2.compute.internal: 1 student.uni-tuebingen.de: 1 nentei.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 10
Total pull requests: 5
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

aakarsh (10)

Pull Request Authors

ilinkaa (7)
aakarsh (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/docker-image.yml actions

actions/checkout v3 composite

Dockerfile docker

python 3.9 build

requirements-colab.txt pypi

accelerate ==0.28.0
bitsandbytes ==0.42.0
bitsandbytes-cuda110 ==0.26.0.post2
datasets ==2.18.0
docker-pycreds ==0.4.0
docstring_parser ==0.16
evaluate ==0.4.1
httpx ==0.26.0
huggingface-hub ==0.20.3
hydra-core ==1.3.2
idna ==3.3
intervaltree ==3.1.0
multidict ==6.0.5
multimethod ==1.8
multiprocess ==0.70.16
nltk ==3.8.1
pip ==23.0.1
pluggy ==1.0.0
ptyprocess ==0.7.0
pycodestyle ==2.7.0
pycparser ==2.21
pycryptodomex ==3.14.1
python-dateutil ==2.8.2
python-lsp-black ==1.0.0
python-lsp-jsonrpc ==1.0.0
python-lsp-server ==1.2.4
python-slugify ==5.0.2
rope ==0.22.0
tiktoken ==0.5.2
tokenizers ==0.13.3
toml ==0.10.2
tomli ==2.0.1
toolz ==0.11.2
transformers ==4.31.0
trl ==0.8.1
untokenize ==0.1.1
urllib3 ==1.26.11
visions ==0.7.4

requirements.txt pypi

accelerate ==0.28.0
bitsandbytes ==0.42.0
bitsandbytes-cuda110 ==0.26.0.post2
datasets ==2.18.0
docker-pycreds ==0.4.0
docstring_parser ==0.16
evaluate ==0.4.1
httpx ==0.26.0
huggingface-hub ==0.20.3
hydra-core ==1.3.2
idna ==3.3
intervaltree ==3.1.0
ipykernel ==6.9.1
ipython ==8.4.0
ipython-genutils ==0.2.0
jupyter-client ==6.1.12
jupyter-core ==4.10.0
jupyter-server ==1.18.1
jupyterlab ==3.2.9
jupyterlab-pygments ==0.1.2
jupyterlab-server ==2.12.0
matplotlib ==3.5.1
matplotlib-inline ==0.1.2
multidict ==6.0.5
multimethod ==1.8
multiprocess ==0.70.16
nbclassic ==0.3.5
nbclient ==0.5.13
nbconvert ==6.4.4
nbformat ==5.3.0
nltk ==3.8.1
np ==1.0.2
numba ==0.55.1
numpy ==1.21.6
numpydoc ==1.4.0
pip ==23.0.1
plotly ==5.7.0
pluggy ==1.0.0
psutil ==5.9.0
ptyprocess ==0.7.0
pycodestyle ==2.7.0
pycparser ==2.21
pycryptodomex ==3.14.1
pytest ==7.1.1
python-dateutil ==2.8.2
python-lsp-black ==1.0.0
python-lsp-jsonrpc ==1.0.0
python-lsp-server ==1.2.4
python-slugify ==5.0.2
pytz ==2022.1
rope ==0.22.0
safetensors ==0.4.2
scikit-image ==0.19.2
scikit-learn ==1.1.1
scipy ==1.8.0
seaborn ==0.11.2
tiktoken ==0.5.2
tokenizers ==0.13.3
toml ==0.10.2
tomli ==2.0.1
toolz ==0.11.2
torch ==1.13.1
torchinfo ==1.7.0
torchvision ==0.14.1
tornado ==6.1
transformers ==4.31.0
trl ==0.8.1
untokenize ==0.1.1
urllib3 ==1.26.11
visions ==0.7.4

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

rl-llm-calibration-test

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

LLM Calibration Benchmark.

Installation

Unit Tests

Running Individual Experiments

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies