tamper-resistance

[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"

https://github.com/rishub-tamirisa/tamper-resistance

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary

Keywords

llm meta-learning open-weight safeguards tamper-resistance

Last synced: 6 months ago · JSON representation ·

Repository

[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"

Basic Info

Host: GitHub
Owner: rishub-tamirisa
License: mit
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2408.00761
Size: 365 KB

Statistics

Stars: 59
Watchers: 1
Forks: 7
Open Issues: 2
Releases: 0

Topics

llm meta-learning open-weight safeguards tamper-resistance

Created over 1 year ago · Last pushed 9 months ago

Metadata Files

Readme License Citation

[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"

by Rishub Tamirisa^*, Bhrugu Bharathi^*, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika

See our project page and paper on ArXiv.

We introduce a novel method, Tampering Attack Resistance (TAR), which is the first defense to withstand a significant number of open-weight fine-tuning attacks on LLMs, while preserving model capabilities.

📰 Updates 📰
🛡️ What are Tamper-Resistant Safeguards? 🛡️
🌐 Overview 🌐
☕ Quick Start ☕
📁 Directory Structure
🤗 Models and Datasets
🙏 Citation 🙏

📰 Updates 📰

**[2024/10/14] TAR-Bio-v2: We identified a data contamination issue in our instruction-following retain dataset; we've resolved the issue and trained a new model: 🤗 Llama-3-8B-Instruct-TAR-Bio-v2. Please use this model for evaluations, thanks!
**[2024/08/07] TAR Release: Initial code release, including red-teaming evaluation + baseline implementations, and 🤗 Huggingface models!

🛡️ What are Tamper-Resistant Safeguards? 🛡️

Tamper-Resistant Safeguards are security measures designed for open-weight large language models (LLMs) to protect against malicious modifications of the model's weights. Unlike traditional safeguards that focus on preventing input-based attacks, these advanced safeguards prevent adversaries with access to full model weights from recovering performance on harmful capabilities. We demonstrate in our extensive red-teaming evaluation that Tamper-Resistant Safeguards created via TAR are the first to be robust to a significant number of open-weight fine-tuning attacks.

🌐 Overview 🌐

This repository contains implementations for TAR (including the Random Mapping initial safeguard), the red-teaming evaluation used in the paper, and baseline methods.

☕ Quick Start ☕

📦 Setup

Clone and enter the repository: bash git clone https://github.com/rishub-tamirisa/tamper-resistance.git cd tamper-resistance
Install dependencies: bash pip install -r requirements.txt
Setup the dotenv (.env):
- In the root level of the repository, create a .env file following the format of the included dotenv file.
- We've already included the FSDP configs used for running the method in the configs folder. You can use these or create your own. For running TAR with FSDP v1, it's important that fsdp_use_orig_params=false and fsdp_sharding_strategy=1.
- Finally, set the environment variables: bash source .env

[!CAUTION] Do not push your .env file to a public repository. Since it contains your Huggingface token and other secrets, it could lead to unauthorized access to your Huggingface account. We've already included it in the .gitignore file to prevent this.

📁 Directory Structure

tar.py serves as the main entrypoint for running the TAR method. It uses python modules in the modules folder. Example usage is provided in the run_tar_bio.sh and run_tar_cyber.sh scripts.

The modules folder contains the following files: - baselines.py: Entrypoint for running baseline methods - dataloaders.py: Dataloader implementations - objectives.py: Objective / loss function implementations - fsdp_v1_utils.py: Utilities for FSDP v1 - training.py: All training loop implementations, including TAR - utils.py: Helper functions

The red_teaming folder contains implementations for running all fine-tuning attacks discussed in the paper, as well as an FSDP-supported MMLU evaluation script.

🛠️ Running Tamper-Resistance Training

[!NOTE]
The current implementation assumes that models come from 🤗 Transformers, meaning they have the expected configs, subclasses, etc. However, the FSDP wrapping can be made compatible with any model. We plan to update the code to be more agnostic when we migrate to FSDP v2. (This repository also serves as a scalable first-order meta-learning implementation)

We provide scripts in the root-level folder for running TAR for biosecurity and cybersecurity: run_tar_bio.sh and run_tar_cyber.sh.

It's recommended to run Llama-3-8B-Instruct models (or similar size) on systems with 8xA100 80G or more VRAM due to full-parameter training and other overheads introduced by the first-order meta-learning implementation.

Note: the code is currently untested on multi-node environments, we expect to support this upon migration to the recently released FSDP2 from PyTorch 2.4.

With the appropriate GPU setup, and assuming the .env is correctly set, simply run:

bash sh run_tar_bio.sh

➕ Running the Red-teaming evaluation

In the red_teaming folder, red_teaming_evaluation.py serves as the entrypoint for running the red-teaming evaluations from the paper. Most methods use full-parameter training, so scripts should be launched with accelerate similar to the setup in the run_tar_bio.sh and run_tar_cyber.sh scripts.

Check out the README documentation in the red_teaming folder for full details, as well as the documentation in red_teaming/mmlu_eval for specific details on running the full evaluation.

🤗 Models and Datasets

We release models and datasets here: 🤗 Huggingface Collection.

Citation

If you find this repository useful in your research, please consider citing our paper:

@misc{tamirisa2024tamperresistantsafeguardsopenweightllms, title={Tamper-Resistant Safeguards for Open-Weight LLMs}, author={Rishub Tamirisa and Bhrugu Bharathi and Long Phan and Andy Zhou and Alice Gatti and Tarun Suresh and Maxwell Lin and Justin Wang and Rowan Wang and Ron Arel and Andy Zou and Dawn Song and Bo Li and Dan Hendrycks and Mantas Mazeika}, year={2024}, eprint={2408.00761}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2408.00761}, }

Owner

Name: Rishub Tamirisa
Login: rishub-tamirisa
Kind: user
Company: University of Illinois at Urbana Champaign

Repositories: 4
Profile: https://github.com/rishub-tamirisa

CS @ University of Illinois at Urbana Champaign

Citation (CITATION.bib)

@misc{tamirisa2024tamperresistantsafeguardsopenweightllms,
      title={Tamper-Resistant Safeguards for Open-Weight LLMs}, 
      author={Rishub Tamirisa and Bhrugu Bharathi and Long Phan and Andy Zhou and Alice Gatti and Tarun Suresh and Maxwell Lin and Justin Wang and Rowan Wang and Ron Arel and Andy Zou and Dawn Song and Bo Li and Dan Hendrycks and Mantas Mazeika},
      year={2024},
      eprint={2408.00761},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.00761}, 
}

GitHub Events

Total

Issues event: 6
Watch event: 20
Issue comment event: 2
Push event: 2
Fork event: 5

Last Year

Issues event: 6
Watch event: 20
Issue comment event: 2
Push event: 2
Fork event: 5

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

tamper-resistance

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"

Table of Contents

📰 Updates 📰

🛡️ What are Tamper-Resistant Safeguards? 🛡️

🌐 Overview 🌐

☕ Quick Start ☕

📦 Setup

📁 Directory Structure

🛠️ Running Tamper-Resistance Training

➕ Running the Red-teaming evaluation

🤗 Models and Datasets

Citation

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year