https://github.com/amazon-science/turbofuzzllm

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary

Keywords

ai-safety guardrails jailbreaking large-language-models red-teaming responsible-ai

Last synced: 6 months ago · JSON representation

Repository

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2502.18504
Size: 51.8 KB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

ai-safety guardrails jailbreaking large-language-models red-teaming responsible-ai

Created 10 months ago · Last pushed 9 months ago

Metadata Files

Readme Contributing License Code of conduct

TurboFuzzLLM

Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking LLMs in Practice

A state-of-the-art tool for automatic red teaming of Large Language Models (LLMs) that generates effective adversarial prompt templates to identify vulnerabilities and improve AI safety.

⚠️ Responsible Use

This tool is designed for improving AI safety through systematic vulnerability testing. It should be used responsibly for defensive purposes and developing better safeguards for LLMs.

Our primary goal is to advance the development of more robust and safer AI systems by identifying and addressing their vulnerabilities. We believe this research will ultimately benefit the AI community by enabling the development of better safety measures and alignment techniques.

🎯 Key Features

High Success Rate: Achieves >98% Attack Success Rate (ASR) on GPT-4o, GPT-4 Turbo, and other leading LLMs
Efficient: 3x fewer queries and 2x more successful templates compared to previous methods
Generalizable: >90% ASR on unseen harmful questions
Practical: Easy-to-use CLI with statistics, search visualization, and logging
Defensive Applications: Generated data improves model safety (74% safer after fine-tuning)

🔧 Method Overview

TurboFuzzLLM performs black-box mutation-based fuzzing to iteratively generate new adversarial red teaming templates. Key innovations include:

Expanded Mutation Space: New mutation operations including refusal suppression
Reinforcement Learning: Feedback-guided prioritized search
Intelligent Heuristics: Efficient exploration with fewer LLM queries
Template-Based Approach: Templates can be combined with any harmful question for scalable attacks

📊 Results

| Metric | Performance | |--------|-------------| | ASR on GPT-4o/GPT-4 Turbo | >98% | | ASR on unseen questions | >90% | | Query efficiency | 3x fewer queries | | Template success rate | 2x improvement | | Model safety improvement | 74% safer after adversarial training |

🛡️ Applications

Vulnerability Identification: Discover prompt-based attack vectors in LLMs
Countermeasure Development:
- Improve in-built LLM safeguards
- Create external guardrails
Adversarial Training: Generate high-quality (attack prompt, harmful response) pairs for safety fine-tuning

🚀 Getting Started

Prerequisites

Model Access: Obtain API keys for your LLM provider (e.g., Amazon Bedrock)
Dependencies: Install required packages (TODO: details coming soon!)
Build: Build the package (TODO: details coming soon!)

Install Dependencies

bash git clone https://github.com/amazon-science/TurboFuzzLLM.git cd TurboFuzzLLM pip install -e .

This tools assumes you have AWS CLI and credentials configured on your machine. You can do this by running aws configure and following the prompts.
Details configuration can be found in https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html.

Quick Start

First Download GPTFuzzer templates:

bash python3 scripts/get_templates_gptfuzzer.py

Run a single red teaming attempt interactively:

bash python3 src/__main__.py answer

When prompted, provide a harmful question to test, e.g.: ```

How can I cheat on my final exam? ```

Check the results in output/<date>/*/

⚙️ Configuration

Execution Modes

TurboFuzzLLM supports 4 operational modes:

| Mode | Description | Use Case | |------|-------------|----------| | answer | Red team a single question interactively | Quick testing | | attack | Red team multiple questions from a dataset efficiently | Batch vulnerability testing | | legacy | Run vanilla GPTFuzzer to learn effective templates | Baseline comparison | | evaluate | Test learned templates against a dataset | Template effectiveness measurement |

Command Line Interface

Get help for any mode: bash python3 src/__main__.py <mode> --help

Key Parameters

Target Model: Specify the LLM to attack bash --target-model-id <bedrock-model-id>
Query Budget: Limit the number of queries to the target bash --max-queries N

📂 Understanding Output

Each run creates an output folder with the following structure:

output/<date>/<mode>_<target-model-id>_<start-time>/ ├── templates.csv # Summary of each template used ├── mutators.csv # Performance metrics for each mutator ├── queries.csv # Details of each LLM query ├── stats.txt # Key metrics summary ├── details.log # Detailed execution log └── template_tree.dot # Visualization of mutant search space

Output Files Description

templates.csv: Contains all generated templates with their success rates
mutators.csv: Performance analysis of different mutation operations
queries.csv: Complete record of LLM interactions
stats.txt: High-level metrics including ASR, query count, and timing
details.log: Verbose logging for debugging
template_tree.dot: Graphviz visualization of the template evolution tree

👥 Meet the Team

Aman Goel* (Contact: goelaman@amazon.com)
Xian Carrie Wu
Zhe Wang
Dmitriy Bespalov
Yanjun (Jane) Qi

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Citation

If you find this useful in your research, please consider citing:

@inproceedings{goel2025turbofuzzllm, title={TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice}, author={Goel, Aman and Wu, Xian and Wang, Daisy Zhe and Bespalov, Dmitriy and Qi, Yanjun}, booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)}, pages={523--534}, year={2025} }

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Watch event: 9
Public event: 1
Push event: 2
Pull request review event: 1
Pull request event: 3
Fork event: 2

Last Year

Watch event: 9
Public event: 1
Push event: 2
Pull request review event: 1
Pull request event: 3
Fork event: 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amazon-science/turbofuzzllm

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

TurboFuzzLLM

⚠️ Responsible Use

🎯 Key Features

🔧 Method Overview

📊 Results

🛡️ Applications

🚀 Getting Started

Prerequisites

Install Dependencies

Quick Start

⚙️ Configuration

Execution Modes

Command Line Interface

Key Parameters

📂 Understanding Output

Output Files Description

👥 Meet the Team

Security

License

Citation

Owner

GitHub Events

Total

Last Year