https://github.com/andreasmadsen/llm-introspection
Interpretability faithfulness and introspection in conversational LLMs
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Repository
Interpretability faithfulness and introspection in conversational LLMs
Basic Info
- Host: GitHub
- Owner: AndreasMadsen
- Language: Python
- Default Branch: main
- Size: 229 KB
Statistics
- Stars: 9
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Are self-explanations from Large Language Models faithful?
This is the code for the paper Are self-explanations from Large Language Models faithful?.
Large language models are increasingly being used by the public, in the form of chat models. These chat systems often provide detailed and highly convincing explanations for their answers, even when not explicitly prompted to do so. This makes users more confident in these models. However, are the explanations true? If not true, this confidence is unsupported which can be dangerous.
We measure the truthfulness (i.e. interpretability-faithfulness) of the explanations that LLMs provide, so called self-explanations. We do so by holding the models accountable to their own explanations, using self-consistency checks. We find that the truthfulness is highly dependent on the model and the specific task. Suggesting we should not have general confidence in these explanations.
install
This module is not published on PyPi but you can install directly with:
bash
python -m pip install -e .
API
The module provides the APIs create custom experiments or reuse existing experiments. As such it's possible to adapt to new models, datasets, or tasks. The API is async.
Docstrings for every class (TGIClient, IMDBDataset, Llama2Model, etc.) are provided in the source files. For a complete example on how to use these class together, please see the ./experiments/analysis.py file.
High level overview of each class
mermaid
graph TD;
TGI-->GPU[[e.g. 4x A100 GPUs]];
VLLM-->GPU[[e.g. 4x A100 GPUs]];
Cache-->FS[[e.g. Lustre FS]];
Client--sqlite-->Cache;
Client-->Offline;
Client--http-->TGI;
Client--http/not used-->VLLM;
Model-->Client;
Task-->Model;
Task-->Dataset;
AsyncMap-->Dataset;
AsyncMap-->Task;
Experiment--configuable--->Task;
Experiment--configuable--->Dataset;
Experiment-->AsyncMap;
Experiment--configuable--->Model;
Experiment--configuable--->Client;
Short example
```python import asyncio import pathlib
from introspect.client import TGIClient from introspect.dataset import IMDBDataset from introspect.model import Llama2Model from introspect.tasks import SentimentCounterfactualTask from introspect.util import AsyncMap from introspect.database import GenerationCache
async def main(): cache = GenerationCache('customexperiment') client = TGIClient('http://127.0.0.1:3000', cache) dataset = IMDBDataset(persistentdir=pathlib.Path('.')) model = Llama2Model(client) task = SentimentCounterfactualTask(model) aggregator = task.make_aggregator()
async with cache:
async for answer in AsyncMap(task, dataset.test(), max_tasks=20):
aggregator.add_answer(answer)
print(aggregator.results)
if name == 'main': asyncio.run(main()) ```
Experiments
Rather than writing your own code, the experiments from the paper can be run
by using the provided CLI: python experiments/analysis.py.
For example:
bash
python experiments/analysis.py
--persistent-dir $SCRATCH/introspect
--endpoint http://127.0.0.1:3000
--task redacted
--task-config ''
--model-name llama2-70b
--dataset IMDB
--split test
--seed 0
Arguments
--persistent-dircontrols where data is stored.--endpointis the URL to REST API used for inference. By default the TGI client is used, however you can specify another client with--client.--taskcontrols the main experiments, eitherclassify,counterfactual,redacted, orimportance.--task-configallows to use a number of different prompt-variations. Which prompt variation is allowed depends on the task.- Classify:
c-persona-you,c-persona-human, otherwise objective personal.m-removedfor the[REMOVED]token, otherwise[REDACTED]. - Counterfactual:
e-persona-you,e-persona-human, otherwise objective personal.e-implcit-targetfor the implicit counterfactual target, otherwise explicit is used. - Redacted and Importance:
e-persona-you,e-persona-human, otherwise objective personal.m-removedfor the[REMOVED]token, otherwise[REDACTED].
- Classify:
--model-namespecify the model, eitherllama2-70b,llama2-7b,falcon-40b,falcon-7b,mistral-v1-7b. This is a shorthand that will resolve to the appropiate huggingface repo and model type. You can also specify these manually with--model-idand--model-typerespectively.--datasetis which dataset. Included datasets areIMDB,RTE,bAbI-1,MCTest.--spliteithertrain,valid, ortest.--seedthe seed used for inference.
Running on a HPC setup
For downloading the required resources we provide a experiment/download.py script
Finally, we provide scripts for submitting all jobs to a Slurm queue, in jobs/. The jobs automatically use #SCRATCH/introspect as the persistent dir.
Owner
- Name: Andreas Madsen
- Login: AndreasMadsen
- Kind: user
- Location: Copenhagen, Denmark
- Company: MILA
- Website: https://andreasmadsen.github.io/
- Twitter: andreas_madsen
- Repositories: 151
- Profile: https://github.com/AndreasMadsen
Researching interpretability for Machine Learning because society needs it.
GitHub Events
Total
- Issues event: 2
- Watch event: 4
- Issue comment event: 4
- Fork event: 2
Last Year
- Issues event: 2
- Watch event: 4
- Issue comment event: 4
- Fork event: 2
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: about 22 hours
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 5.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: about 22 hours
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 5.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- dschaehi (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- aiohttp >= 3.8.0,<4.0.0
- aiosqlite >= 0.19.0,<0.20.0
- asyncstdlib >= 3.10.0,<4.0.0
- bottleneck >= 1.3.6
- datasets >= 2.14.6,<2.15.0
- fastparquet >= 2023.2.0
- numexpr >= 2.8.4
- numpy >= 1.25.2
- pandas >= 2.0.0,<3.0.0
- plotnine >= 0.12.0
- regex >= 2023.8.8
- tblib >= 2.0.0,<3.0.0
- text-generation >= 0.6.0
- tqdm >= 4.66.1