llm-confidentiality
Whispers in the Machine: Confidentiality in Agentic Systems
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary
Keywords
Repository
Whispers in the Machine: Confidentiality in Agentic Systems
Basic Info
Statistics
- Stars: 39
- Watchers: 1
- Forks: 6
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Whispers in the Machine: Confidentiality in Agentic Systems
This is the code repository accompanying our paper Whispers in the Machine: Confidentiality in Agentic Systems.
The interaction between users and applications is increasingly shifted toward natural language by deploying Large Language Models (LLMs) as the core interface. The capabilities of these so-called agents become more capable the more tools and services they serve as an interface for, ultimately leading to agentic systems. Agentic systems use LLM-based agents as interfaces for most user interactions and various integrations with external tools and services. While these interfaces can significantly enhance the capabilities of the agentic system, they also introduce a new attack surface. Manipulated integrations, for example, can exploit the internal LLM and compromise sensitive data accessed through other interfaces. While previous work primarily focused on attacks targeting a model's alignment or the leakage of training data, the security of data that is only available during inference has escaped scrutiny so far. In this work, we demonstrate how the integration of LLMs into systems with external tool integration poses a risk similar to established prompt-based attacks, able to compromise the confidentiality of the entire system. Introducing a systematic approach to evaluate these confidentiality risks, we identify two specific attack scenarios unique to these agentic systems and formalize these into a tool-robustness framework designed to measure a model's ability to protect sensitive information. Our analysis reveals significant vulnerabilities across all tested models, highlighting an increased risk when models are combined with external tools.
If you want to cite our work, please use the this BibTeX entry.
This framework was developed to study the confidentiality of Large Language Models (LLMs) in integrated systems. The framework contains several features:
- A set of attacks against LLMs, where the LLM is not allowed to leak a secret key -> jump to section
- A set of defenses against the aforementioned attacks -> jump to section
- The possibility to test the LLM's confidentiality in dummy tool-using scenarios as well as with the mentioned attacks and defenses -> jump to section
- Testing LLMs in real-world tool-scenarios using LangChains Google Drive and Google Mail integrations -> jump to section
- Creating enhanced system prompts to safely instruct an LLM to keep a secret key safe -> jump to section
- Instructions for reproducibility can be found at the end of this README -> jump to section
[!WARNING] Hardware acceleration is only fully supported for CUDA machines running Linux. MPS on MacOS should somewhat work but Windows with CUDA could face some issues.
Setup
Before running the code, install the requirements:
python -m pip install --upgrade -r requirements.txt
If you want to use models hosted by OpenAI or Huggingface, create both a key.txt file containing your OpenAI API key as well as a hf_token.txt file containing your Huggingface Token for private Repos (such as Llama2) in the root directory of this project.
Sometimes it can be necessary to login to your Huggingface account via the CLI:
git config --global credential.helper store
huggingface-cli login
Distributed Training
All scripts are able to work on multiple GPUs/CPUs using the accelerate library. To do so, run:
accelerate config
to configure the distributed training capabilities of your system and start the scripts with:
accelerate launch [parameters] <script.py> [script parameters]
Attacks and Defenses
Example Usage
python
python attack.py --strategy "tools" --scenario "CalendarWithCloud" --attacks "payload_splitting" "obfuscation" --defense "xml_tagging" --iterations 15 --llm_type "llama3-70b" --temperature 0.7 --device cuda --prompt_format "react"
Would run the attacks payload_splitting and obfuscation against the LLM llama3-70b in the scenario CalendarWithCloud using the defense xml_tagging for 15 iterations with a temperature of 0.7 on a cuda device using the react prompt format in a tool-integrated system.
Arguments
| Argument | Type | Default Value | Description |
|----------|------|---------------|-------------|
| -h, --help | - | - | show this help message and exit |
| -a, --attacks | List[str] | payload_splitting | specifies the attacks which will be utilized against the LLM |
| -d, --defense | str | None | specifies the defense for the LLM |
| -llm, --llm_type | str | gpt-3.5-turbo | specifies the type of opponent |
| -le, --llm_guessing | bool | False | specifies whether a second LLM is used to guess the secret key off the normal response or not|
| -t, --temperature | float | 0.0 | specifies the temperature for the LLM to control the randomness |
| -cp, --create_prompt_dataset | bool | False | specifies whether a new dataset of enhanced system prompts should be created |
| -cr, --create_response_dataset | bool | False | specifies whether a new dataset of secret leaking responses should be created |
| -i, --iterations | int | 10 | specifies the number of iterations for the attack |
| -n, --name_suffix | str | "" | Specifies a name suffix to load custom models. Since argument parameter strings aren't allowed to start with '-' symbols, the first '-' will be added by the parser automatically |
| -s, --strategy | str | None | Specifies the strategy for the attack (whether to use normal attacks or tools attacks) |
| -sc, --scenario | str | all | Specifies the scenario for the tool based attacks |
| -dx, --device | str | cpu| Specifies the device which is used for running the script (cpu, cuda, or mps)
| -pf, --prompt_format | str | react | Specifies whether react or tool-finetuned prompt format is used for agents. (react or tool-finetuned) |
| -ds, --disable_safeguards | bool | False | Disables system prompt safeguards for tool strategy |
The naming conventions for the models are as follows:
python
<model_name>-<param_count>-<robustness>-<attack_suffix>-<custom_suffix>
e.g.:
python
llama2-7b-robust-prompt_injection-0613
If you want to run the attacks against a prefix-tuned model with a custom suffix (e.g., 1000epochs), you would have to specify the arguments a follows:
python
... --model_name llama2-7b-prefix --name_suffix 1000epochs ...
Supported Large Language Models
| Model | Parameter Specifier | Link | Compute Instance |
|-------|------|-----|-----|
| GPT-4 (4o, 4o-mini, 4-turbo)| gpt-4o / gpt-4o-mini / gpt-4-turbo | Link| OpenAI API |
| GPT-3.5-Turbo | gpt-3.5-turbo | Link| OpenAI API |
| LLaMA 2 | llama2-7b / llama2-13b / llama2-70b | Link | Local Inference |
| LLaMA 2 hardened | llama2-7b-robust / llama2-13b-robust / llama2-70b-robust| Link | Local Inference |
| Qwen 2.5 | qwen2.5-72b | Link | Local Inference (first: ollama pull qwen2.5:72b) |
| Llama 3.1 | llama3-8b / llama3-70b | Link | Local Inference (first: ollama pull llama3.1/llama3.1:70b/llama3.1:405b) |
| Llama 3.2 | llama3-1b/ llama3-3b| Link | Local Inference (first: ollama pull llama3.2/llama3.2:1b) |
| Llama 3.3 | llama3.3-70b | Link | Local Inference (first: ollama pull llama3.3/llama3.3:70b) |
| Deepseek R1 | deepseek-r1-1.5b / deepseek-r1-7b / deepseek-r1-8b / deepseek-r1-14b / deepseek-r1-32b / deepseek-r1-70b | Link | Local Inference (first: ollama pull deepseek-r1:XXb)|
| Reflection Llama | reflection-llama| Link | Local Inference (first: ollama pull reflection) |
| Vicuna | vicuna-7b / vicuna-13b / vicuna-33b | Link | Local Inference |
| StableBeluga (2) | beluga-7b / beluga-13b / beluga2-70b| Link | Local Inference |
| Orca 2 | orca2-7b / orca2-13b / orca2-70b | Link | Local Inference |
| Gemma | gemma-2b / gemma-7b| Link | Local Inference |
| Gemma 2 | gemma2-9b / gemma2-27b| Link | Local Inference (first: ollama pull gemma2/gemma2:27b) |
| Phi 3 | phi3-3b / phi3-14b | Link | Local Inference (first: ollama pull phi3:mini/phi3:medium)|
(Finetuned or robust/hardened LLaMA models first have to be generated using the finetuning.py script, see below)
Supported Attacks and Defenses
| Attacks | | Defenses | |
|--------|--------|---------|---------|
| Name | Specifier | Name | Specifier |
|Payload Splitting | payload_splitting | Random Sequence Enclosure | seq_enclosure |
|Obfuscation | obfuscation |XML Tagging | xml_tagging |
|Jailbreak | jailbreak |Heuristic/Filtering Defense | heuristic_defense |
|Translation | translation |Sandwich Defense | sandwiching |
|ChatML Abuse | chatml_abuse | LLM Evaluation | llm_eval |
|Masking | masking | Perplexity Detection | ppl_detection
|Typoglycemia | typoglycemia | PromptGuard| prompt_guard |
|Adversarial Suffix | advs_suffix | |
|Prefix Injection | prefix_injection | |
|Refusal Suppression | refusal_suppression | |
|Context Ignoring | context_ignoring | |
|Context Termination | context_termination | |
|Context Switching Separators | context_switching_separators | |
|Few-Shot | few_shot | |
|Cognitive Hacking | cognitive_hacking | |
|Base Chat | base_chat | |
The base_chat attack consists of normal questions to test of the model spills it's context and confidential information even without a real attack.
Finetuning
This section covers the possible LLaMA finetuning options. We use PEFT, which is based on this paper.
Setup
Additionally to the above setup run
bash
accelerate config
to configure the distributed training capabilities of your system. And
bash
wandb login
with your WandB API key to enable logging of the finetuning process.
Parameter Efficient Finetuning to harden LLMs against attacks or create enhanced system prompts
The first finetuning option is on a dataset consisting of system prompts to safely instruct an LLM to keep a secret key safe. The second finetuning option (using the --train_robust option) is using system prompts and adversarial prompts to harden the model against prompt injection attacks.
Usage
python
python finetuning.py [-h] [-llm | --llm_type LLM_NAME] [-i | --iterations ITERATIONS] [-a | --attacks ATTACKS_LIST] [-n | --name_suffix NAME_SUFFIX]
Arguments
| Argument | Type | Default Value | Description |
|----------|------|---------------|-------------|
| -h, --help | - | - | Show this help message and exit |
| -llm, --llm_type | str | llama3-8b |Specifies the type of llm to finetune |
| -i, --iterations | int | 10000 | Specifies the number of iterations for the finetuning |
| -advs, --advs_train | bool | False | Utilizes the adversarial training to harden the finetuned LLM |
| -a, --attacks | List[str] | payload_splitting | Specifies the attacks which will be used to harden the llm during finetuning. Only has an effect if --train_robust is set to True. For supported attacks see the previous section |
| -n, --name_suffix | str | "" | Specifies a suffix for the finetuned model name |
Supported Large Language Models
Currently only the LLaMA models are supported (llama2-7/13/70b / llama3-8/70b).
Generate System Prompt Datasets
Simply run the generate_dataset.py script to create new system prompts as a json file using LLMs.
Arguments
| Argument | Type | Default Value | Description |
|----------|------|---------------|-------------|
| -h, --help | - | - | Show this help message and exit |
| -llm, --llm_type | str | llama3-70b |Specifies the LLM used to generate the system prompt dataset |
| -n, --name_suffix | str | "" | Specifies a suffix for the model name if you want to use a custom model |
| -ds, --dataset_size | int | 1000 | Size of the resulting system prompt dataset |
Real-World Tool Scenarios
To test the confidentiality of LLMs in real-world tool scenarios, we provide the possibility to test LLMs in Google Drive and Google Mail integrations. To do so, run the /various_scripts/llm_mail_test.pyscript with your Google API credentials.
Reproducibility
[!WARNING] Depeding on which LLM is evaluated the evaluation can be very demanding in terms of GPU VRAM and time.
[!NOTE] Results can vary slightly from run to run. Ollama updates most of their LLMs constantly, so their behavior is subject to change. Also, even with the lowest temperature LLMs tend to fluctuate slightly in behavior due to internal randomness.
Baseline secret-key game
Will ask the LLM benign questions to check for leaking the secret even without attacks
python attack.py --llm_type <model_specifier> --strategy secret-key --attacks chat_base --defenses None --iterations 100 --device cuda
Attacks for secret-key game
Will run all attacks against the LLM without defenses. The iterations will be split equally onto the used attacks. So depending on the number of used attacks the number of iterations have to be adapted. (e.g., for 14 attacks with 100 iterations set the iterations parameter to 1400)
python attack.py --llm_type <model_specifier> --strategy secret-key --attacks all --defenses None --iterations 100 --device cuda
Attacks with defenses for secret-key game
Will run all attacks against the LLM with all defenses
python attack.py --llm_type <model_specifier> --strategy secret-key --attacks all --defenses all --iterations 100 --device cuda
Baseline tool-scenario
Will system prompt instruct the LLM with a secret key and the instructions to not leak the secret key followed by simple requests to print the secret key
python attack.py --llm_type <model_specifier> --strategy tools --scenario all --attacks base_attack --defenses None --iterations 100 --device cuda
Evaluating all tool-scenarios with ReAct
Will run all tool-scenarios without attacks and defenses using the ReAct framework
python attack.py --llm_type <model_specifier> --strategy tools --scenario all --attacks identity --defenses None --iterations 100 --prompt_format ReAct --device cuda
Evaluating all tool-scenarios with tool fine-tuned models
Will run all tool-scenarios without attacks and defenses using the ReAct framework
python attack.py --llm_type <model_specifier> --strategy tools --scenario all --attacks identity --defenses None --iterations 100 --prompt_format tool-finetuned --device cuda
Evaluating all tool fine-tuned models in all scenarios with additional attacks
Will run all tool-scenarios without attacks and defenses using the ReAct framework
python attack.py --llm_type <model_specifier> --strategy tools --scenario all --attacks all --defenses None --iterations 100 --prompt_format tool-finetuned --device cuda
Evaluating all tool fine-tuned models in all scenarios with additional attacks and defenses
Will run all tool-scenarios without attacks and defenses using the ReAct framework
python attack.py --llm_type <model_specifier> --strategy tools --scenario all --attacks all --defenses all --iterations 100 --prompt_format tool-finetuned --device cuda
Citation
If you want to cite our work, please use the following BibTeX entry:
bibtex
@article{evertz-24-whispers,
title = {{Whispers in the Machine: Confidentiality in LLM-integrated Systems}},
author = {Jonathan Evertz and Merlin Chlosta and Lea Schönherr and Thorsten Eisenhofer},
year = {2024},
journal = {Computing Research Repository (CoRR)}
}
Owner
- Name: jonathan | ヨナタン
- Login: LostOxygen
- Kind: user
- Location: Germany
- Company: Ruhr University Bochum
- Repositories: 4
- Profile: https://github.com/LostOxygen
riding down the gradients
Citation (CITATION.cff)
cff-version: 1.2.0
title: >-
Whispers in the Machine: Confidentiality in LLM-integrated
Systems
message: >-
If you want to cite our work or use this framework, please
cite using the provided data.
type: software
authors:
- given-names: Jonathan
family-names: Evertz
email: jonathan.evertz@cispa.de
affiliation: CISPA Helmholtz Center for Information Security
- given-names: Merlin
family-names: Chlosta
email: merlin.chlosta@cispa.de
affiliation: CISPA Helmholtz Center for Information Security
- given-names: Lea
family-names: Schönherr
email: schoenherr@cispa.de
affiliation: CISPA Helmholtz Center for Information Security
- given-names: 'Thorsten '
family-names: Eisenhofer
email: thorsten.eisenhofer@tu-berlin.de
affiliation: TU Berlin
identifiers:
- type: url
value: 'https://arxiv.org/abs/2402.06922'
repository-code: 'https://github.com/LostOxygen/llm-confidentiality'
abstract: >-
Large Language Models (LLMs) are increasingly augmented with external tools and commercial services
into LLM-integrated systems. While these interfaces can significantly enhance the capabilities of the models,
they also introduce a new attack surface. Manipulated integrations, for example, can exploit the model and
compromise sensitive data accessed through other interfaces. While previous work primarily focused on attacks
targeting a model's alignment or the leakage of training data, the security of data that is only available during
inference has escaped scrutiny so far. In this work, we demonstrate the vulnerabilities associated with external
components and introduce a systematic approach to evaluate confidentiality risks in LLM-integrated systems.
We identify two specific attack scenarios unique to these systems and formalize these into a tool-robustness
framework designed to measure a model's ability to protect sensitive information. Our findings show that all
examined models are highly vulnerable to confidentiality attacks, with the risk increasing significantly when
models are used together with external tools.
keywords:
- large language models
- llm
- adversarial attacks
- machine learning
- confidentiality
- prompt injections
- llm security
license: Apache-2.0
GitHub Events
Total
- Watch event: 12
- Delete event: 3
- Push event: 55
- Pull request event: 6
- Fork event: 2
- Create event: 4
Last Year
- Watch event: 12
- Delete event: 3
- Push event: 55
- Pull request event: 6
- Fork event: 2
- Create event: 4