rl-llm-calibration-test
Attempt at replication of the parts of the paper "Language models (mostly) know what they know", on open datasets, and models.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 9 committers (11.1%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.7%) to scientific vocabulary
Keywords
Repository
Attempt at replication of the parts of the paper "Language models (mostly) know what they know", on open datasets, and models.
Basic Info
- Host: GitHub
- Owner: aakarsh
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://arxiv.org/pdf/2207.05221.pdf
- Size: 24.7 MB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 1
- Open Issues: 10
- Releases: 0
Topics
Metadata Files
README.md
LLM Calibration Benchmark.
This repository attempts to run benchmarks on some popular openly available language models.
Installation
pip install -r requirements. tx
When running in the colab environment it is recommended ot use.
pip install -r requirements-colab.txt
Unit Tests
Running unit tests requires pytest module invoked as follows:
python -m pytest test
Running Individual Experiments
Any individual experiment can be rerun using the following command
python ../llm_calibration/run_experiment.py --model_name='meta-llama/Llama-2-13b-hf' --dataset='STEM'
The experimental result will produce a json result files which can be parsed offline to generate the requisite plots.
Owner
- Name: Aakarsh Nair
- Login: aakarsh
- Kind: user
- Location: Portland, OR
- Company: www.nentei.com
- Website: https://www.aakarsh.io
- Twitter: aakarsh
- Repositories: 365
- Profile: https://github.com/aakarsh
“The present moment is the only moment available to us and it is the door to all other moments.” ~TNH
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Examining Calibration Large Language Model in Question
Answering
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Aakarsh
family-names: Nair
email: aakarsh.nair@student.uni-tuebingen.de
affiliation: Eberhard Karls Universität Tübingen
- given-names: Ilinca
family-names: Vandici
affiliation: Eberhard Karls Universität Tübingen
- given-names: Linus
family-names: Rösener
affiliation: Eberhard Karls Universität Tübingen
- {}
repository-code: 'https://github.com/aakarsh/rl-llm-calibration-test'
abstract: >
In this paper, we examine the issue of calibration of
large language models.
That is the interaction between the \emph{confidence} of a
predicted answer
on a question-answering task with its \emph{empirical
likelihood
of being correct}.
We replicate elements of previous calibration study
\cite{kadavath2022language}
on several multiple-choice (MMLU, LogicQA, TruthfulQA)
and
open-ended question answering datasets translated into
the multiple choice format (TriviaQA, HumanEval, GSM8k).
We find that models do scale in their calibration ability
by model size.
Moreover models fine-tuned for conversation improve in
calibration and
accuracy under multi-shot prompting. However, we also
observe that for tasks beyond
models reasoning capability (complex logical and
scientific reasoning) fine-tuning harms models
accuracy and leads to overconfidence in model predictions.
license: GPL-1.0+
GitHub Events
Total
Last Year
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| aakarsh nair | a****h@n****m | 159 |
| linusroesener | 5****r | 14 |
| Aakarsh Nair | a****r@g****m | 13 |
| Bela Linus Roesner | l****r@s****e | 5 |
| ILINCA VANDICI | i****u@g****m | 4 |
| Ubuntu | u****u@i****l | 4 |
| zxoxo45@uni-tuebingen.de | t****5@u****n | 1 |
| zxoxo45@uni-tuebingen.de | t****5@u****n | 1 |
| zxoxo45@uni-tuebingen.de | t****5@u****n | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 10
- Total pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- aakarsh (10)
Pull Request Authors
- ilinkaa (7)
- aakarsh (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v3 composite
- python 3.9 build
- accelerate ==0.28.0
- bitsandbytes ==0.42.0
- bitsandbytes-cuda110 ==0.26.0.post2
- datasets ==2.18.0
- docker-pycreds ==0.4.0
- docstring_parser ==0.16
- evaluate ==0.4.1
- httpx ==0.26.0
- huggingface-hub ==0.20.3
- hydra-core ==1.3.2
- idna ==3.3
- intervaltree ==3.1.0
- multidict ==6.0.5
- multimethod ==1.8
- multiprocess ==0.70.16
- nltk ==3.8.1
- pip ==23.0.1
- pluggy ==1.0.0
- ptyprocess ==0.7.0
- pycodestyle ==2.7.0
- pycparser ==2.21
- pycryptodomex ==3.14.1
- python-dateutil ==2.8.2
- python-lsp-black ==1.0.0
- python-lsp-jsonrpc ==1.0.0
- python-lsp-server ==1.2.4
- python-slugify ==5.0.2
- rope ==0.22.0
- tiktoken ==0.5.2
- tokenizers ==0.13.3
- toml ==0.10.2
- tomli ==2.0.1
- toolz ==0.11.2
- transformers ==4.31.0
- trl ==0.8.1
- untokenize ==0.1.1
- urllib3 ==1.26.11
- visions ==0.7.4
- accelerate ==0.28.0
- bitsandbytes ==0.42.0
- bitsandbytes-cuda110 ==0.26.0.post2
- datasets ==2.18.0
- docker-pycreds ==0.4.0
- docstring_parser ==0.16
- evaluate ==0.4.1
- httpx ==0.26.0
- huggingface-hub ==0.20.3
- hydra-core ==1.3.2
- idna ==3.3
- intervaltree ==3.1.0
- ipykernel ==6.9.1
- ipython ==8.4.0
- ipython-genutils ==0.2.0
- jupyter-client ==6.1.12
- jupyter-core ==4.10.0
- jupyter-server ==1.18.1
- jupyterlab ==3.2.9
- jupyterlab-pygments ==0.1.2
- jupyterlab-server ==2.12.0
- matplotlib ==3.5.1
- matplotlib-inline ==0.1.2
- multidict ==6.0.5
- multimethod ==1.8
- multiprocess ==0.70.16
- nbclassic ==0.3.5
- nbclient ==0.5.13
- nbconvert ==6.4.4
- nbformat ==5.3.0
- nltk ==3.8.1
- np ==1.0.2
- numba ==0.55.1
- numpy ==1.21.6
- numpydoc ==1.4.0
- pip ==23.0.1
- plotly ==5.7.0
- pluggy ==1.0.0
- psutil ==5.9.0
- ptyprocess ==0.7.0
- pycodestyle ==2.7.0
- pycparser ==2.21
- pycryptodomex ==3.14.1
- pytest ==7.1.1
- python-dateutil ==2.8.2
- python-lsp-black ==1.0.0
- python-lsp-jsonrpc ==1.0.0
- python-lsp-server ==1.2.4
- python-slugify ==5.0.2
- pytz ==2022.1
- rope ==0.22.0
- safetensors ==0.4.2
- scikit-image ==0.19.2
- scikit-learn ==1.1.1
- scipy ==1.8.0
- seaborn ==0.11.2
- tiktoken ==0.5.2
- tokenizers ==0.13.3
- toml ==0.10.2
- tomli ==2.0.1
- toolz ==0.11.2
- torch ==1.13.1
- torchinfo ==1.7.0
- torchvision ==0.14.1
- tornado ==6.1
- transformers ==4.31.0
- trl ==0.8.1
- untokenize ==0.1.1
- urllib3 ==1.26.11
- visions ==0.7.4