https://github.com/cluebbers/adverserial-paraphrasing

Evaluate how LLaMA 3.1 8B handles paraphrased adversarial prompts targeting refusal behavior.

https://github.com/cluebbers/adverserial-paraphrasing

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.9%) to scientific vocabulary

Keywords

deep-learning direct-preference-optimization redteam reinforcement-learning
Last synced: 5 months ago · JSON representation

Repository

Evaluate how LLaMA 3.1 8B handles paraphrased adversarial prompts targeting refusal behavior.

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
deep-learning direct-preference-optimization redteam reinforcement-learning
Created 10 months ago · Last pushed 9 months ago
Metadata Files
Readme License

README.md

Adversarial Paraphrasing Red-Teaming for LLaMA, Mistral & Pythia

This repository delivers a reproducible pipeline to evaluate and improve “refusal” behavior in three open-weight LLMs—LLaMA-3.1-8B, Mistral-7B-v0.1, and Pythia-6.9B—under adversarial paraphrasing. Full technical report (PDF). Trained adapters can be found on Huggingface. This project was done for the spring 2025 cohort of AI Safety, Ethics and Society.

🚀 Key Features

  • Prompt set: 64 harmful base prompts × 4 variants (canonical, lexical, syntactic, semantic), including six real-world case studies (e.g. Tokyo sarin, Unit 731, Unabomber).
  • Evaluation scripts:
    • run_inference.py — batch-runs all prompts through any base model/pipeline.
    • run_inference_lora.py- batch-run with lora adapters
    • annotate_outputs.py — interactive refusal/harm labeling.
    • evaluation.ipynb — computes refusal and harmfulness rates, generates publication-quality bar charts.
  • Alignment adapters: LoRA rank-8 checkpoints for both
    • SFT on 580 prompt→refusal pairs, and
    • DPO on 580 prompt­–chosenvsrejected triples.
  • Results:
    • Baseline refusal: 2–14 \%; harmful: up to 62 \%.
    • DPO gains: modest (+4–38 \% refusal; –24–40 \% harm).
    • SFT gains: dramatic (+60–96 \% refusal; harmful ≤ 16 \%).

📂 Repository Structure

text . ├── data/ │ ├── base_prompts.json # 64 prompts │ ├── paraphrased_prompts.json # 64 prompts × 4 variants │ ├── dpo_train.jsonl # 580 DPO triples │ └── sft_train.jsonl # 580 SFT doubles ├── scripts/ │ ├── run_inference.py │ ├── run_inference_lora.py │ ├── annotate_outputs.py │ ├── evaluation.ipynb │ ├── train_dpo.py │ └── train_sft.py ├── figures/ │ ├── refusal_harmful_rates.pdf │ └── paraphrase_types.pdf ├── 2025-05-09_Luebbers_report.pdf ├── requirements.txt └── README.md

🛠️ Quickstart

Tested on

bash torch==2.6.0, transformers==4.51.3 datasets==3.5.0 accelerate==1.6.0 bitsandbytes==0.45.5 matplotlib==3.10.1 trl==0.17.0 peft==0.15.2

  1. Install dependencies:

bash pip install -r requirements.txt

  1. Get model access:

https://huggingface.co/meta-llama/Llama-3.1-8B https://huggingface.co/mistralai/Mistral-7B-v0.1

  1. Run inference:

possible models: "pythia": "EleutherAI/pythia-6.9b" "mistral": "mistralai/Mistral-7B-v0.1" "llama": "meta-llama/Meta-Llama-3.1-8B"

bash python scripts/run_inference.py \ --model llama

and adapters either "sft" or "dpo"

bash python scripts/run_inference_lora.py \ --model llama \ --adapter dpo

  1. Annotate outputs: You need to specify the input and output files in the script

bash python scripts/annotate_outputs.py

  1. Inspect results with scripts/evaluation.ipynb

📑 Key Findings

Paraphrase-aware SFT yields the largest safety gains with minimal compute. Even with only 580 examples, SFT yields near-perfect refusal on all three models.

| Method | Avg. Refusal ↑ | Avg. Harm ↓ | | :------: | :------------: | :---------: | | Baseline | 6 \% | 41 \% | | DPO | 17 \% | 22 \% | | SFT | 89 \% | 8 \% |

Model Alignment Results

📖 Citing This Work

bibtex @article{lubbers2025refusal, title={Evaluating Refusal Robustness under Adversarial Paraphrasing}, author={Luebbers, Christopher L.}, year={2025}, howpublished={\url{https://github.com/cluebbers/adverserial-paraphrasing}} }


Feel free to explore, adapt, or extend this toolkit for your own red-teaming and alignment research!

Owner

  • Login: cluebbers
  • Kind: user
  • Location: Göttingen
  • Company: University of Göttingen

studying Applied Data Science Interested in Natural Language Processing.

GitHub Events

Total
  • Release event: 1
  • Push event: 7
  • Create event: 1
Last Year
  • Release event: 1
  • Push event: 7
  • Create event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels