https://github.com/andreasmadsen/llm-introspection

Interpretability faithfulness and introspection in conversational LLMs

https://github.com/andreasmadsen/llm-introspection

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Interpretability faithfulness and introspection in conversational LLMs

Basic Info
  • Host: GitHub
  • Owner: AndreasMadsen
  • Language: Python
  • Default Branch: main
  • Size: 229 KB
Statistics
  • Stars: 9
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme

README.md

Are self-explanations from Large Language Models faithful?

This is the code for the paper Are self-explanations from Large Language Models faithful?.

Large language models are increasingly being used by the public, in the form of chat models. These chat systems often provide detailed and highly convincing explanations for their answers, even when not explicitly prompted to do so. This makes users more confident in these models. However, are the explanations true? If not true, this confidence is unsupported which can be dangerous.

We measure the truthfulness (i.e. interpretability-faithfulness) of the explanations that LLMs provide, so called self-explanations. We do so by holding the models accountable to their own explanations, using self-consistency checks. We find that the truthfulness is highly dependent on the model and the specific task. Suggesting we should not have general confidence in these explanations.

install

This module is not published on PyPi but you can install directly with:

bash python -m pip install -e .

API

The module provides the APIs create custom experiments or reuse existing experiments. As such it's possible to adapt to new models, datasets, or tasks. The API is async.

Docstrings for every class (TGIClient, IMDBDataset, Llama2Model, etc.) are provided in the source files. For a complete example on how to use these class together, please see the ./experiments/analysis.py file.

High level overview of each class

mermaid graph TD; TGI-->GPU[[e.g. 4x A100 GPUs]]; VLLM-->GPU[[e.g. 4x A100 GPUs]]; Cache-->FS[[e.g. Lustre FS]]; Client--sqlite-->Cache; Client-->Offline; Client--http-->TGI; Client--http/not used-->VLLM; Model-->Client; Task-->Model; Task-->Dataset; AsyncMap-->Dataset; AsyncMap-->Task; Experiment--configuable--->Task; Experiment--configuable--->Dataset; Experiment-->AsyncMap; Experiment--configuable--->Model; Experiment--configuable--->Client;

Short example

```python import asyncio import pathlib

from introspect.client import TGIClient from introspect.dataset import IMDBDataset from introspect.model import Llama2Model from introspect.tasks import SentimentCounterfactualTask from introspect.util import AsyncMap from introspect.database import GenerationCache

async def main(): cache = GenerationCache('customexperiment') client = TGIClient('http://127.0.0.1:3000', cache) dataset = IMDBDataset(persistentdir=pathlib.Path('.')) model = Llama2Model(client) task = SentimentCounterfactualTask(model) aggregator = task.make_aggregator()

async with cache:
    async for answer in AsyncMap(task, dataset.test(), max_tasks=20):
        aggregator.add_answer(answer)

print(aggregator.results)

if name == 'main': asyncio.run(main()) ```

Experiments

Rather than writing your own code, the experiments from the paper can be run by using the provided CLI: python experiments/analysis.py.

For example:

bash python experiments/analysis.py --persistent-dir $SCRATCH/introspect --endpoint http://127.0.0.1:3000 --task redacted --task-config '' --model-name llama2-70b --dataset IMDB --split test --seed 0

Arguments

  • --persistent-dir controls where data is stored.
  • --endpoint is the URL to REST API used for inference. By default the TGI client is used, however you can specify another client with --client.
  • --task controls the main experiments, either classify,
  • counterfactual, redacted, or importance.
  • --task-config allows to use a number of different prompt-variations. Which prompt variation is allowed depends on the task.
    • Classify: c-persona-you, c-persona-human, otherwise objective personal. m-removed for the [REMOVED] token, otherwise [REDACTED].
    • Counterfactual: e-persona-you, e-persona-human, otherwise objective personal. e-implcit-target for the implicit counterfactual target, otherwise explicit is used.
    • Redacted and Importance: e-persona-you, e-persona-human, otherwise objective personal. m-removed for the [REMOVED] token, otherwise [REDACTED].
  • --model-name specify the model, either llama2-70b, llama2-7b, falcon-40b, falcon-7b, mistral-v1-7b. This is a shorthand that will resolve to the appropiate huggingface repo and model type. You can also specify these manually with --model-id and --model-type respectively.
  • --dataset is which dataset. Included datasets are IMDB, RTE, bAbI-1, MCTest.
  • --split either train, valid, or test.
  • --seed the seed used for inference.

Running on a HPC setup

For downloading the required resources we provide a experiment/download.py script

Finally, we provide scripts for submitting all jobs to a Slurm queue, in jobs/. The jobs automatically use #SCRATCH/introspect as the persistent dir.

Owner

  • Name: Andreas Madsen
  • Login: AndreasMadsen
  • Kind: user
  • Location: Copenhagen, Denmark
  • Company: MILA

Researching interpretability for Machine Learning because society needs it.

GitHub Events

Total
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 4
  • Fork event: 2
Last Year
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 4
  • Fork event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: about 22 hours
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 5.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: about 22 hours
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 5.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • dschaehi (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

pyproject.toml pypi
  • aiohttp >= 3.8.0,<4.0.0
  • aiosqlite >= 0.19.0,<0.20.0
  • asyncstdlib >= 3.10.0,<4.0.0
  • bottleneck >= 1.3.6
  • datasets >= 2.14.6,<2.15.0
  • fastparquet >= 2023.2.0
  • numexpr >= 2.8.4
  • numpy >= 1.25.2
  • pandas >= 2.0.0,<3.0.0
  • plotnine >= 0.12.0
  • regex >= 2023.8.8
  • tblib >= 2.0.0,<3.0.0
  • text-generation >= 0.6.0
  • tqdm >= 4.66.1