https://github.com/alignmentresearch/harmbench

Fork of HarmBench for getting R2D2 working

Last synced: 10 months ago · JSON representation

Repository

Fork of HarmBench for getting R2D2 working

Basic Info

Host: GitHub
Owner: AlignmentResearch
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage: https://harmbench.org
Size: 101 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Fork of centerforaisafety/HarmBench

Created over 1 year ago · Last pushed about 1 year ago

https://github.com/AlignmentResearch/HarmBench/blob/main/

HarmBench

# HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

#### A fast, scalable, and open-source framework for evaluating automated red teaming methods and LLM attacks/defenses ## Table of Contents - [ Latest News ](#-latest-news-) - [ What is HarmBench ](#%EF%B8%8F-what-is-harmbench-%EF%B8%8F) - [ Overview ](#-overview-) - [ Quick Start ](#-quick-start-) - [ Installation](#%EF%B8%8F-installation) - [ Running the Evaluation Pipeline](#%EF%B8%8F-running-the-evaluation-pipeline) - [ Using your own models in HarmBench](#-using-your-own-models-in-harmbench) - [ Using your own red teaming methods in HarmBench](#-using-your-own-red-teaming-methods-in-harmbench) - [ Classifiers](#-classifiers) - [ Documentation ](#-documentation-) - [ HarmBench's Roadmap ](#-harmbench-roadmap-) - [ Acknowledgement and Citation ](#-acknowledgements-and-citation-) ## Latest News * ***[2024/02/26] [HarmBench 1.0: adversarial training code+model, precomputed test cases, and various improvements](https://github.com/centerforaisafety/HarmBench/pull/19)*** * ***[2024/02/06] [Initial release of HarmBench with 33 evaluated LLMs and 18 red teaming methods]()*** ## What is HarmBench Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess these methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify key considerations previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of $18$ red teaming methods and $33$ target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. ## Overview

There are two primary ways to use HarmBench: (1) evaluating red teaming methods against a set of LLMs, and (2) evaluating LLMs against a set of red teaming methods. These use cases are both supported by the same evaluation pipeline, illustrated above. The process for running this evaluation pipeline is described below in the Quick Start section and in the [evaluation pipeline docs](./docs/evaluation_pipeline.md) HarmBench has out-of-the-box support for transformers-compatible LLMs, numerous closed-source APIs, and several multimodal models. To add new red teaming methods and models that aren't compatible with transformers, it may help to learn more about how the codebase is structured. We provide a set of docs to help with more advanced usage. ## Quick Start ### Installation ```bash git clone https://github.com/centerforaisafety/HarmBench.git cd HarmBench pip install -r requirements.txt python -m spacy download en_core_web_sm ``` ### Running The Evaluation Pipeline The HarmBench evaluation pipeline consists of three steps (with an optional merging step), shown in the figure above: (1) generating test cases, (1.5) merging test cases, (2) generating completions, (3) evaluating completions. We provide two ways of running the evaluation pipeline: - A higher-level run_pipeline.py utility that automates parts of the pipeline - Scripts for running individual steps of the pipeline manually #### Using run_pipeline.py The easiest way to run the evaluation pipeline is with `./scripts/run_pipeline.py`. Examples are given below. ```bash # Run GCG against all compatible models using a SLURM cluster python ./scripts/run_pipeline.py --methods GCG --models all --step all --mode slurm # Run all compatible attacks against Llama 2 7B Chat using a SLURM cluster python ./scripts/run_pipeline.py --methods all --models llama2_7b --step all --mode slurm # --- # Generate test cases for a subset of methods and models using a SLURM cluster python ./scripts/run_pipeline.py --methods ZeroShot,PEZ,TAP --models baichuan2_7b,mistral_7b,llama2_70b --step 1 --mode slurm # Merge test cases for the above methods on a local machine python ./scripts/run_pipeline.py --methods ZeroShot,PEZ,TAP --models baichuan2_7b,mistral_7b,llama2_70b --step 1.5 --mode local # Generate and evaluate completions for the above methods and models using a SLURM cluster python ./scripts/run_pipeline.py --methods ZeroShot,PEZ,TAP --models baichuan2_7b,mistral_7b,llama2_70b --step 2_and_3 --mode slurm ``` In the `slurm` mode, jobs are submitted with appropriate dependencies. In the `local` mode, steps of the pipeline are sequentially run on the current machine. In the `local_parallel` mode, Ray is used to parallelize jobs across GPUs in a single machine without needing to install Slurm. We recommend using the `slurm` mode (or submitting a PR for a different cluster manager) to enable parallelizing test case generation across multiple machines. For single machines with multiple GPUs, we recommend using `local_parallel` mode. Additional details on how to use `run_pipeline.py` are in the [evaluation pipeline docs](./docs/evaluation_pipeline.md). #### Running individual steps For more control over the evaluation pipeline, you can manually run individual steps using `scripts/generate_test_cases.sh`, `scripts/merge_test_cases.sh`, `scripts/generate_completions.sh`, and `scripts/evaluate_completions.sh`. For more information on how to do this, see the [evaluation pipeline docs](./docs/evaluation_pipeline.md). ### Using your own models in HarmBench You can easily add new Hugging Face transformers models in [configs/model_configs/models.yaml](configs/model_configs/models.yaml) by simply adding an entry for your model. This model can then be directly evaluated on most red teaming methods without modifying the method configs (using our dynamic experiment config parsing code, described in [./docs/configs.md](./docs/configs.md)). Some methods (AutoDAN, PAIR, TAP) require manually adding experiment configs for new models. ### Using your own red teaming methods in HarmBench All of our red teaming methods are implemented in [baselines](baselines), imported through [baselines/__init__.py](baselines/__init__.py), and managed by [configs/method_configs](configs/method_configs). You can easily improve on top of existing red teaming methods or add new methods by simply making a new subfolder in the `baselines` directory. New attacks are required to implement the methods in the `RedTeamingMethod` class in `baselines/baseline.py`. ## Classifiers HarmBench provide 3 classifier models: - [cais/HarmBench-Llama-2-13b-cls](https://huggingface.co/cais/HarmBench-Llama-2-13b-cls) for standard and contextual behaviors. Example to use can be found [here](https://github.com/centerforaisafety/HarmBench/blob/main/notebooks/run_classifier.ipynb) - [cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors](https://huggingface.co/cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors) for multimodal behaviors. Example to use can be found [here](https://github.com/centerforaisafety/HarmBench/blob/main/notebooks/run_classifier_multimodal.ipynb) - [cais/HarmBench-Mistral-7b-val-cls](https://huggingface.co/cais/HarmBench-Mistral-7b-val-cls) is a validation classifier and support standard, contextual and multimodal behaviors. ## Documentation For more details documentation about each component in the framework - [**Codebase Structure**](./docs/codebase_structure.md) - [**Method and Model Configs**](./docs/configs.md) - [**Evaluation Pipeline**](./docs/evaluation_pipeline.md) ## HarmBench Roadmap We plan to update to HarmBench over the next few weeks with the following: - Tutorials for adding attacks and defenses - Validation classifiers - Additional red teaming methods - Additional target models - Additional harmful behaviors - Support for system prompt attacks ## Acknowledgements and Citation We thank the following open-source repositories. If aspects of these repositories appearing in HarmBench are useful to you in your research, we ask that you consider citing the accompanying papers. [1] https://github.com/llm-attacks/llm-attacks [2] https://github.com/patrickrchao/JailbreakingLLMs [3] https://github.com/RICommunity/TAP [4] https://github.com/SheltonLiu-N/AutoDAN [5] https://github.com/lm-sys/FastChat/tree/main/fastchat [6] https://github.com/ray-project/ray [7] https://github.com/vllm-project/vllm [8] https://github.com/huggingface/transformers [10] https://github.com/CHATS-lab/persuasive_jailbreaker [11] https://github.com/huggingface/alignment-handbook If you find HarmBench useful in your research, please consider citing our [paper](https://arxiv.org/abs/2402.04249): ``` @article{mazeika2024harmbench, title={HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal}, author={Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks}, year={2024}, eprint={2402.04249}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```

Owner

Name: FAR AI
Login: AlignmentResearch
Kind: organization
Email: hello@far.ai

Website: https://far.ai
Repositories: 16
Profile: https://github.com/AlignmentResearch

FAR AI is an alignment research non-profit working to ensure AI systems are trustworthy and beneficial to society.

GitHub Events

Total

Delete event: 1
Issue comment event: 2
Member event: 2
Push event: 54
Pull request review comment event: 11
Pull request review event: 3
Pull request event: 2
Create event: 2

Last Year

Delete event: 1
Issue comment event: 2
Member event: 2
Push event: 54
Pull request review comment event: 11
Pull request review event: 3
Pull request event: 2
Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/alignmentresearch/harmbench

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/AlignmentResearch/HarmBench/blob/main/

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels