https://github.com/amazon-science/turbofuzzllm

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice

https://github.com/amazon-science/turbofuzzllm

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.5%) to scientific vocabulary

Keywords

ai-safety guardrails jailbreaking large-language-models red-teaming responsible-ai
Last synced: 6 months ago · JSON representation

Repository

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice

Basic Info
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
ai-safety guardrails jailbreaking large-language-models red-teaming responsible-ai
Created 10 months ago · Last pushed 9 months ago
Metadata Files
Readme Contributing License Code of conduct

README.md

TurboFuzzLLM

Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking LLMs in Practice

A state-of-the-art tool for automatic red teaming of Large Language Models (LLMs) that generates effective adversarial prompt templates to identify vulnerabilities and improve AI safety.

⚠️ Responsible Use

This tool is designed for improving AI safety through systematic vulnerability testing. It should be used responsibly for defensive purposes and developing better safeguards for LLMs.

Our primary goal is to advance the development of more robust and safer AI systems by identifying and addressing their vulnerabilities. We believe this research will ultimately benefit the AI community by enabling the development of better safety measures and alignment techniques.

🎯 Key Features

  • High Success Rate: Achieves >98% Attack Success Rate (ASR) on GPT-4o, GPT-4 Turbo, and other leading LLMs
  • Efficient: 3x fewer queries and 2x more successful templates compared to previous methods
  • Generalizable: >90% ASR on unseen harmful questions
  • Practical: Easy-to-use CLI with statistics, search visualization, and logging
  • Defensive Applications: Generated data improves model safety (74% safer after fine-tuning)

🔧 Method Overview

TurboFuzzLLM performs black-box mutation-based fuzzing to iteratively generate new adversarial red teaming templates. Key innovations include:

  1. Expanded Mutation Space: New mutation operations including refusal suppression
  2. Reinforcement Learning: Feedback-guided prioritized search
  3. Intelligent Heuristics: Efficient exploration with fewer LLM queries
  4. Template-Based Approach: Templates can be combined with any harmful question for scalable attacks

📊 Results

| Metric | Performance | |--------|-------------| | ASR on GPT-4o/GPT-4 Turbo | >98% | | ASR on unseen questions | >90% | | Query efficiency | 3x fewer queries | | Template success rate | 2x improvement | | Model safety improvement | 74% safer after adversarial training |

🛡️ Applications

  1. Vulnerability Identification: Discover prompt-based attack vectors in LLMs
  2. Countermeasure Development:
    • Improve in-built LLM safeguards
    • Create external guardrails
  3. Adversarial Training: Generate high-quality (attack prompt, harmful response) pairs for safety fine-tuning

🚀 Getting Started

Prerequisites

  1. Model Access: Obtain API keys for your LLM provider (e.g., Amazon Bedrock)
  2. Dependencies: Install required packages (TODO: details coming soon!)
  3. Build: Build the package (TODO: details coming soon!)

Install Dependencies

bash git clone https://github.com/amazon-science/TurboFuzzLLM.git cd TurboFuzzLLM pip install -e .

Quick Start

  • First Download GPTFuzzer templates:

bash python3 scripts/get_templates_gptfuzzer.py

Run a single red teaming attempt interactively:

bash python3 src/__main__.py answer

When prompted, provide a harmful question to test, e.g.: ```

How can I cheat on my final exam? ```

Check the results in output/<date>/*/

⚙️ Configuration

Execution Modes

TurboFuzzLLM supports 4 operational modes:

| Mode | Description | Use Case | |------|-------------|----------| | answer | Red team a single question interactively | Quick testing | | attack | Red team multiple questions from a dataset efficiently | Batch vulnerability testing | | legacy | Run vanilla GPTFuzzer to learn effective templates | Baseline comparison | | evaluate | Test learned templates against a dataset | Template effectiveness measurement |

Command Line Interface

Get help for any mode: bash python3 src/__main__.py <mode> --help

Key Parameters

  • Target Model: Specify the LLM to attack bash --target-model-id <bedrock-model-id>

  • Query Budget: Limit the number of queries to the target bash --max-queries N

📂 Understanding Output

Each run creates an output folder with the following structure:

output/<date>/<mode>_<target-model-id>_<start-time>/ ├── templates.csv # Summary of each template used ├── mutators.csv # Performance metrics for each mutator ├── queries.csv # Details of each LLM query ├── stats.txt # Key metrics summary ├── details.log # Detailed execution log └── template_tree.dot # Visualization of mutant search space

Output Files Description

  • templates.csv: Contains all generated templates with their success rates
  • mutators.csv: Performance analysis of different mutation operations
  • queries.csv: Complete record of LLM interactions
  • stats.txt: High-level metrics including ASR, query count, and timing
  • details.log: Verbose logging for debugging
  • template_tree.dot: Graphviz visualization of the template evolution tree

👥 Meet the Team

  • Aman Goel* (Contact: goelaman@amazon.com)
  • Xian Carrie Wu
  • Zhe Wang
  • Dmitriy Bespalov
  • Yanjun (Jane) Qi

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Citation

If you find this useful in your research, please consider citing:

@inproceedings{goel2025turbofuzzllm, title={TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice}, author={Goel, Aman and Wu, Xian and Wang, Daisy Zhe and Bespalov, Dmitriy and Qi, Yanjun}, booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)}, pages={523--534}, year={2025} }

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Watch event: 9
  • Public event: 1
  • Push event: 2
  • Pull request review event: 1
  • Pull request event: 3
  • Fork event: 2
Last Year
  • Watch event: 9
  • Public event: 1
  • Push event: 2
  • Pull request review event: 1
  • Pull request event: 3
  • Fork event: 2