tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Keywords
Repository
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
Basic Info
- Host: GitHub
- Owner: rishub-tamirisa
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2408.00761
- Size: 365 KB
Statistics
- Stars: 59
- Watchers: 1
- Forks: 7
- Open Issues: 2
- Releases: 0
Topics
Metadata Files
README.md
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
by Rishub Tamirisa*, Bhrugu Bharathi*, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika
See our project page and paper on ArXiv.
We introduce a novel method, Tampering Attack Resistance (TAR), which is the first defense to withstand a significant number of open-weight fine-tuning attacks on LLMs, while preserving model capabilities.
Table of Contents
- 📰 Updates 📰
- 🛡️ What are Tamper-Resistant Safeguards? 🛡️
- 🌐 Overview 🌐
- ☕ Quick Start ☕
- 📁 Directory Structure
- 🤗 Models and Datasets
- 🙏 Citation 🙏
📰 Updates 📰
- **[2024/10/14] TAR-Bio-v2: We identified a data contamination issue in our instruction-following retain dataset; we've resolved the issue and trained a new model: 🤗 Llama-3-8B-Instruct-TAR-Bio-v2. Please use this model for evaluations, thanks!
- **[2024/08/07] TAR Release: Initial code release, including red-teaming evaluation + baseline implementations, and 🤗 Huggingface models!
🛡️ What are Tamper-Resistant Safeguards? 🛡️
Tamper-Resistant Safeguards are security measures designed for open-weight large language models (LLMs) to protect against malicious modifications of the model's weights. Unlike traditional safeguards that focus on preventing input-based attacks, these advanced safeguards prevent adversaries with access to full model weights from recovering performance on harmful capabilities. We demonstrate in our extensive red-teaming evaluation that Tamper-Resistant Safeguards created via TAR are the first to be robust to a significant number of open-weight fine-tuning attacks.
🌐 Overview 🌐
This repository contains implementations for TAR (including the Random Mapping initial safeguard), the red-teaming evaluation used in the paper, and baseline methods.
☕ Quick Start ☕
📦 Setup
Clone and enter the repository:
bash git clone https://github.com/rishub-tamirisa/tamper-resistance.git cd tamper-resistanceInstall dependencies:
bash pip install -r requirements.txtSetup the dotenv (
.env):- In the root level of the repository, create a
.envfile following the format of the includeddotenvfile. - We've already included the FSDP configs used for running the method in the
configsfolder. You can use these or create your own. For running TAR with FSDP v1, it's important thatfsdp_use_orig_params=falseandfsdp_sharding_strategy=1. - Finally, set the environment variables:
bash source .env
- In the root level of the repository, create a
[!CAUTION] Do not push your
.envfile to a public repository. Since it contains your Huggingface token and other secrets, it could lead to unauthorized access to your Huggingface account. We've already included it in the.gitignorefile to prevent this.
📁 Directory Structure
tar.py serves as the main entrypoint for running the TAR method. It uses python modules in the modules folder. Example usage is provided in the run_tar_bio.sh and run_tar_cyber.sh scripts.
The modules folder contains the following files:
- baselines.py: Entrypoint for running baseline methods
- dataloaders.py: Dataloader implementations
- objectives.py: Objective / loss function implementations
- fsdp_v1_utils.py: Utilities for FSDP v1
- training.py: All training loop implementations, including TAR
- utils.py: Helper functions
The red_teaming folder contains implementations for running all fine-tuning attacks discussed in the paper, as well as an FSDP-supported MMLU evaluation script.
🛠️ Running Tamper-Resistance Training
[!NOTE]
The current implementation assumes that models come from 🤗 Transformers, meaning they have the expected configs, subclasses, etc. However, the FSDP wrapping can be made compatible with any model. We plan to update the code to be more agnostic when we migrate to FSDP v2. (This repository also serves as a scalable first-order meta-learning implementation)
We provide scripts in the root-level folder for running TAR for biosecurity and cybersecurity: run_tar_bio.sh and run_tar_cyber.sh.
It's recommended to run Llama-3-8B-Instruct models (or similar size) on systems with 8xA100 80G or more VRAM due to full-parameter training and other overheads introduced by the first-order meta-learning implementation.
Note: the code is currently untested on multi-node environments, we expect to support this upon migration to the recently released FSDP2 from PyTorch 2.4.
With the appropriate GPU setup, and assuming the .env is correctly set, simply run:
bash
sh run_tar_bio.sh
➕ Running the Red-teaming evaluation
In the red_teaming folder, red_teaming_evaluation.py serves as the entrypoint for running the red-teaming evaluations from the paper. Most methods use full-parameter training, so scripts should be launched with accelerate similar to the setup in the run_tar_bio.sh and run_tar_cyber.sh scripts.
Check out the README documentation in the red_teaming folder for full details, as well as the documentation in red_teaming/mmlu_eval for specific details on running the full evaluation.
🤗 Models and Datasets
We release models and datasets here: 🤗 Huggingface Collection.
Citation
If you find this repository useful in your research, please consider citing our paper:
@misc{tamirisa2024tamperresistantsafeguardsopenweightllms,
title={Tamper-Resistant Safeguards for Open-Weight LLMs},
author={Rishub Tamirisa and Bhrugu Bharathi and Long Phan and Andy Zhou and Alice Gatti and Tarun Suresh and Maxwell Lin and Justin Wang and Rowan Wang and Ron Arel and Andy Zou and Dawn Song and Bo Li and Dan Hendrycks and Mantas Mazeika},
year={2024},
eprint={2408.00761},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.00761},
}
Owner
- Name: Rishub Tamirisa
- Login: rishub-tamirisa
- Kind: user
- Company: University of Illinois at Urbana Champaign
- Repositories: 4
- Profile: https://github.com/rishub-tamirisa
CS @ University of Illinois at Urbana Champaign
Citation (CITATION.bib)
@misc{tamirisa2024tamperresistantsafeguardsopenweightllms,
title={Tamper-Resistant Safeguards for Open-Weight LLMs},
author={Rishub Tamirisa and Bhrugu Bharathi and Long Phan and Andy Zhou and Alice Gatti and Tarun Suresh and Maxwell Lin and Justin Wang and Rowan Wang and Ron Arel and Andy Zou and Dawn Song and Bo Li and Dan Hendrycks and Mantas Mazeika},
year={2024},
eprint={2408.00761},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.00761},
}
GitHub Events
Total
- Issues event: 6
- Watch event: 20
- Issue comment event: 2
- Push event: 2
- Fork event: 5
Last Year
- Issues event: 6
- Watch event: 20
- Issue comment event: 2
- Push event: 2
- Fork event: 5