https://github.com/andreasmadsen/llm-introspection

Interpretability faithfulness and introspection in conversational LLMs

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Interpretability faithfulness and introspection in conversational LLMs

Basic Info

Host: GitHub
Owner: AndreasMadsen
Language: Python
Default Branch: main
Size: 229 KB

Statistics

Stars: 9
Watchers: 2
Forks: 2
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme

Are self-explanations from Large Language Models faithful?

This is the code for the paper Are self-explanations from Large Language Models faithful?.

Large language models are increasingly being used by the public, in the form of chat models. These chat systems often provide detailed and highly convincing explanations for their answers, even when not explicitly prompted to do so. This makes users more confident in these models. However, are the explanations true? If not true, this confidence is unsupported which can be dangerous.

We measure the truthfulness (i.e. interpretability-faithfulness) of the explanations that LLMs provide, so called self-explanations. We do so by holding the models accountable to their own explanations, using self-consistency checks. We find that the truthfulness is highly dependent on the model and the specific task. Suggesting we should not have general confidence in these explanations.

install

This module is not published on PyPi but you can install directly with:

bash python -m pip install -e .

API

The module provides the APIs create custom experiments or reuse existing experiments. As such it's possible to adapt to new models, datasets, or tasks. The API is async.

Docstrings for every class (TGIClient, IMDBDataset, Llama2Model, etc.) are provided in the source files. For a complete example on how to use these class together, please see the ./experiments/analysis.py file.

High level overview of each class

mermaid graph TD; TGI-->GPU[[e.g. 4x A100 GPUs]]; VLLM-->GPU[[e.g. 4x A100 GPUs]]; Cache-->FS[[e.g. Lustre FS]]; Client--sqlite-->Cache; Client-->Offline; Client--http-->TGI; Client--http/not used-->VLLM; Model-->Client; Task-->Model; Task-->Dataset; AsyncMap-->Dataset; AsyncMap-->Task; Experiment--configuable--->Task; Experiment--configuable--->Dataset; Experiment-->AsyncMap; Experiment--configuable--->Model; Experiment--configuable--->Client;

Short example

```python import asyncio import pathlib

from introspect.client import TGIClient from introspect.dataset import IMDBDataset from introspect.model import Llama2Model from introspect.tasks import SentimentCounterfactualTask from introspect.util import AsyncMap from introspect.database import GenerationCache

async def main(): cache = GenerationCache('customexperiment') client = TGIClient('http://127.0.0.1:3000', cache) dataset = IMDBDataset(persistentdir=pathlib.Path('.')) model = Llama2Model(client) task = SentimentCounterfactualTask(model) aggregator = task.make_aggregator()

async with cache:
    async for answer in AsyncMap(task, dataset.test(), max_tasks=20):
        aggregator.add_answer(answer)

print(aggregator.results)

if name == 'main': asyncio.run(main()) ```

Experiments

Rather than writing your own code, the experiments from the paper can be run by using the provided CLI: python experiments/analysis.py.

For example:

bash python experiments/analysis.py --persistent-dir $SCRATCH/introspect --endpoint http://127.0.0.1:3000 --task redacted --task-config '' --model-name llama2-70b --dataset IMDB --split test --seed 0

Arguments

--persistent-dir controls where data is stored.
--endpoint is the URL to REST API used for inference. By default the TGI client is used, however you can specify another client with --client.
--task controls the main experiments, either classify,
counterfactual, redacted, or importance.
--task-config allows to use a number of different prompt-variations. Which prompt variation is allowed depends on the task.
- Classify: c-persona-you, c-persona-human, otherwise objective personal. m-removed for the [REMOVED] token, otherwise [REDACTED].
- Counterfactual: e-persona-you, e-persona-human, otherwise objective personal. e-implcit-target for the implicit counterfactual target, otherwise explicit is used.
- Redacted and Importance: e-persona-you, e-persona-human, otherwise objective personal. m-removed for the [REMOVED] token, otherwise [REDACTED].
--model-name specify the model, either llama2-70b, llama2-7b, falcon-40b, falcon-7b, mistral-v1-7b. This is a shorthand that will resolve to the appropiate huggingface repo and model type. You can also specify these manually with --model-id and --model-type respectively.
--dataset is which dataset. Included datasets are IMDB, RTE, bAbI-1, MCTest.
--split either train, valid, or test.
--seed the seed used for inference.

Running on a HPC setup

For downloading the required resources we provide a experiment/download.py script

Finally, we provide scripts for submitting all jobs to a Slurm queue, in jobs/. The jobs automatically use #SCRATCH/introspect as the persistent dir.

Owner

Name: Andreas Madsen
Login: AndreasMadsen
Kind: user
Location: Copenhagen, Denmark
Company: MILA

Website: https://andreasmadsen.github.io/
Twitter: andreas_madsen
Repositories: 151
Profile: https://github.com/AndreasMadsen

Researching interpretability for Machine Learning because society needs it.

GitHub Events

Total

Issues event: 2
Watch event: 4
Issue comment event: 4
Fork event: 2

Last Year

Issues event: 2
Watch event: 4
Issue comment event: 4
Fork event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: about 22 hours
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 5.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: about 22 hours
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 5.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

dschaehi (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

pyproject.toml pypi

aiohttp >= 3.8.0,<4.0.0
aiosqlite >= 0.19.0,<0.20.0
asyncstdlib >= 3.10.0,<4.0.0
bottleneck >= 1.3.6
datasets >= 2.14.6,<2.15.0
fastparquet >= 2023.2.0
numexpr >= 2.8.4
numpy >= 1.25.2
pandas >= 2.0.0,<3.0.0
plotnine >= 0.12.0
regex >= 2023.8.8
tblib >= 2.0.0,<3.0.0
text-generation >= 0.6.0
tqdm >= 4.66.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/andreasmadsen/llm-introspection

Science Score: 23.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Are self-explanations from Large Language Models faithful?

install

API

High level overview of each class

Short example

Experiments

Arguments

Running on a HPC setup

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies