detect-clip-backdoor-samples
[ICLR2025] Detecting Backdoor Samples in Contrastive Language Image Pretraining
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Keywords
Repository
[ICLR2025] Detecting Backdoor Samples in Contrastive Language Image Pretraining
Basic Info
- Host: GitHub
- Owner: HanxunH
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://hanxunh.github.io/Detect-CLIP-Backdoor-Samples/
- Size: 33.5 MB
Statistics
- Stars: 7
- Watchers: 1
- Forks: 2
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
Detecting Backdoor Samples in Contrastive Language Image Pretraining
Code for ICLR2025 "Detecting Backdoor Samples in Contrastive Language Image Pretraining"
In this work, we introduce a simple yet highly efficient detection approach for web-scale datasets, specifically designed to detect backdoor samples in CLIP. Our method is highly scalable and capable of handling datasets ranging from millions to billions of samples.
- Key Insight: We identify a critical weakness of CLIP backdoor samples, rooted in the sparsity of their representation within their local neighborhood (see Figure below). This property enables the use of highly accurate and efficient local density-based detectors for detection.
- Comprehensive Evaluation: We conduct a systematic study on the detectability of poisoning backdoor attacks on CLIP and demonstrate that existing detection methods, designed for supervised learning, often fail when applied to CLIP.
- Practical Implication: We uncover unintentional (natural) backdoors in the CC3M dataset, which have been injected into a popular open-source model released by OpenCLIP.
Use detection method on a pretrained CLIP encoders and their training images
We provide a collection of detectors for identifying backdoor samples in web-scale datasets. Below, we include examples to help you quickly get started with their usage.
```python
model: CLIP encoder trained on these images (using the OpenCLIP implementation)
images: A randomly sampled batch of training images [b, c, h, w]. The larger the batch, the better.
Note: If the CLIP encoder requires input normalization, ensure that images are normalized accordingly.
import backdoorsampledetector
computemode = 'donotusemmforeucliddist' # Better precision useddp = False # Change to true if using DDP detector = backdoorsampledetector.DAODetector(k=16, esttype='mle', gatherdistributed=useddp, computemode=computemode) scores = detector(model=model, images=images) # tensor with shape [b]
A higher score indicate more likely to be backdoor samples,
```
- We use all other samples within the batch as references for local neighborhood selection when calculating the scores. Alternatively, dedicated reference sets can also be used. For details, refer to the
get_pair_wise_distancefunction. - The current implementation assumes that the randomly sampled batch reflects the real poisoning rate of the full dataset. However, users may also employ a custom reference set for local neighborhood selection. For further analysis, see Appendix B.5 of the paper.
The unintentional (natural) backdoor samples found in CC3M and reverse-engineered from the OpenCLIP model (RN50 trained on CC12M)
We applied our detection method to a real-world web-scale dataset and identified several potential unintentional (natural) backdoor samples. Using these samples, we successfully reverse-engineered the corresponding trigger.
- These images appear 798 times in the dataset, accounting for approximately 0.03% of the CC3M dataset.
- These images share similar content and the same caption: “the birthday cake with candles in the form of a number icon.”
- We suspect that these images are natural (unintentional) backdoor samples that have been learned by models trained on the Conceptual Captions dataset.
Validate the reverse-engineered trigger
The following commands apply the trigger to the entire ImageNet validation set using the RN50 CLIP encoder pre-trained on cc12m, evaluated on the zero-shot classification task. An additional class with the target caption (“the birthday cake with candles in the form of a number icon”) is added. This setup is expected to confirm that the trigger achieves a 98.8% Attack Success Rate (ASR).
```shell python3 birthdaycakeexample.py --dataset ImageNet --datapath PATH/TO/YOUR/DATASET --cachedir PATH/TO/YOUR/CHECKPOINT
To use the default path, simply drop the --cache_dir argument.
```
What if there are no backdoor samples in the training set?
One might ask, what if the dataset is completely clean? We perform detection in the same way on the "Clean" CC3M dataset without simulating the adversary poisoning the training set. Beyond identifying potential natural backdoor samples, our detector can also flag noisy samples. For instance, many URLs in web-scale datasets are expired, and placeholder images are used for these URLs, while the original dataset still includes captions for the expired images that are still valid URLs (also see Carlini's paper explaining this). After retrieving from the web, this mismatch between image content and text descriptions creates inconsistencies. Using our detector, we can easily identify these mismatched samples as well. A collection of such samples is provided below.
Quick start
We provide a notebook for a quick-start demonstration. While we did not explicitly experiment with the detection performance at test time in the paper, our method should remain effective in such scenarios.
In QuickStart.ipynb, we include an example of test-time detection using the pre-trained model from our paper. For simplicity, we assume a low poisoning rate, allowing us to use the default implementation, which computes the backdoor score using the same batch of data as a reference. In cases where this assumption does not hold, using a small clean subset as a reference may be necessary.
The pre-trained weights can be found in this Google Drive link. Note: These pre-trained models contain backdoors.
HuggingFace
A collection of pre-trained models with interjected backdoor triggers is available on HuggingFace. An example demonstrating how to use these models can be found in the HuggingFaceExample.ipynb notebook. These models correspond to the results reported in Tables 1, 10, and 11 of our paper, and they can be used for quick verification of backdoor sample detection or for conducting experiments on detecting backdoored models.
Reproduce results from the paper
Due to the dynamic nature of web-scale datasets, some URLs may expire, making it difficult to reproduce the exact clean accuracy. However, the attack success rate and detection results remain unaffected.
For example, in this work, we successfully reproduced the results reported by Carlini & Terzis (2022), except for clean accuracy, as we could not access the complete CC3M dataset. Specifically, we were only able to retrieve 2.3 million image-text pairs from CC3M due to expired URLs.
- Step1: Install the required packages from
requirements.txt. - Step2: Prepare the datasets. Refer to img2dataset for guidance.
- Step3: Check
*.yamlfile from configs folders to fill in the path to the dataset. - Step4: Run the following commands for pre-training, extracting backdoor scores, and calculating detection performance. The default implementation uses Distributed Data Parallel (DDP) within a SLURM environment. Adjustments may be necessary depending on your hardware setup. A non-DDP implementation is also provided.
```console
Pre-training
srun python3 mainclip.py --ddp --disteval \
--expname pretrain \
--exppath PATH/TO/EXPFOLDER \
--expconfig PATH/TO/CONFIG/FOLDER
``
A metadata file named trainpoisoninfo.json` will be generated to record which samples are randomly selected as backdoor samples, along with additional information such as the location of the trigger in the image and the poisoned target text description. This metadata is essential for subsequent detection steps to “recreate” the poisoning set.
```console
Run detection and compute the backdoor score
Choice detectors from['CD', 'IsolationForest', 'LID', 'KDistance', 'SLOF', 'DAO']
srun python3 extractbdscores.py --ddp --disteval \ --expname pretrain \ --exppath PATH/TO/EXPFOLDER \ --exp_config PATH/TO/CONFIG/FOLDER \ --detectors DAO ```
A *_scores.h5 file will be generated based on the selected detector. This file contains a list of scores for each sample, where the index of the list corresponds to the index of the sample in the training dataset.
```console
Run compute detection performance
python3 processdetectionscores.py --ddp --disteval \
--expname pretrain \
--exppath PATH/TO/EXPFOLDER \
--expconfig PATH/TO/CONFIG/FOLDER \
``
This process computes the detection performance in terms of the area under the receiver operating characteristic curve (AUROC) for all detectors. Method will be skipped if the corresponding *scores.h5` file is missing.
Citation
If you use the code or pre-trained models in your work, please cite the accompanying paper:
@inproceedings{
huang2025detecting,
title={Detecting Backdoor Samples in Contrastive Language Image Pretraining},
author={Hanxun Huang and Sarah Erfani and Yige Li and Xingjun Ma and James Bailey},
booktitle={ICLR},
year={2025},
}
Acknowledgements
This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200.
Part of the code is based on the following repo:
- https://github.com/mlfoundations/open_clip
- https://github.com/BigML-CS-UCLA/RoCLIP
- https://github.com/HangerYang/SafeCLIP
- https://github.com/bboylyg/Multi-Trigger-Backdoor-Attacks
- https://github.com/HanxunH/CognitiveDistillation
Owner
- Name: Hanxun Huang
- Login: HanxunH
- Kind: user
- Location: Melbourne / Beijing
- Company: The University of Melbourne
- Repositories: 7
- Profile: https://github.com/HanxunH
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Huang"
given-names: "Hanxun"
orcid: "https://orcid.org/0000-0002-2793-6680"
- family-names: "Erfani"
given-names: "Sarah"
orcid: "https://orcid.org/0000-0003-0885-0643"
- family-names: "Li"
given-names: "Yige"
orcid: "https://orcid.org/0000-0001-5032-8571"
- family-names: "Ma"
given-names: "Xingjun"
orcid: "https://orcid.org/0000-0003-2099-4973"
- family-names: "Bailey"
given-names: "James"
orcid: "https://orcid.org/0000-0002-3769-3811"
title: "Detecting Backdoor Samples in Contrastive Language Image Pretraining"
version: 0.0.1
date-released: 2025-01-23
url: "https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples"
preferred-citation:
type: conference-paper
title: "Detecting Backdoor Samples in Contrastive Language Image Pretraining"
authors:
- family-names: "Huang"
given-names: "Hanxun"
orcid: "https://orcid.org/0000-0002-2793-6680"
- family-names: "Erfani"
given-names: "Sarah"
orcid: "https://orcid.org/0000-0003-0885-0643"
- family-names: "Li"
given-names: "Yige"
orcid: "https://orcid.org/0000-0001-5032-8571"
- family-names: "Ma"
given-names: "Xingjun"
orcid: "https://orcid.org/0000-0003-2099-4973"
- family-names: "Bailey"
given-names: "James"
orcid: "https://orcid.org/0000-0002-3769-3811"
collection-title: "ICLR"
year: 2025
GitHub Events
Total
- Issues event: 3
- Watch event: 10
- Issue comment event: 7
- Member event: 1
- Public event: 1
- Push event: 14
- Fork event: 2
Last Year
- Issues event: 3
- Watch event: 10
- Issue comment event: 7
- Member event: 1
- Public event: 1
- Push event: 14
- Fork event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- NielsRogge (1)
- akshitjindal1 (1)