detect-clip-backdoor-samples

[ICLR2025] Detecting Backdoor Samples in Contrastive Language Image Pretraining

https://github.com/hanxunh/detect-clip-backdoor-samples

Keywords

backdoor backdoor-attack backdoor-attacks backdoors clip contrastive-language-image-pretraining contrastive-learning deep deep-learning detection iclr iclr2025 outlier-detection poisoning-attack poisoning-defenses pytorch safety

Last synced: 6 months ago · JSON representation ·

Repository

[ICLR2025] Detecting Backdoor Samples in Contrastive Language Image Pretraining

Basic Info

Host: GitHub
Owner: HanxunH
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage: https://hanxunh.github.io/Detect-CLIP-Backdoor-Samples/
Size: 33.5 MB

Statistics

Stars: 7
Watchers: 1
Forks: 2
Open Issues: 1
Releases: 0

Topics

backdoor backdoor-attack backdoor-attacks backdoors clip contrastive-language-image-pretraining contrastive-learning deep deep-learning detection iclr iclr2025 outlier-detection poisoning-attack poisoning-defenses pytorch safety

Created about 1 year ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

README.md

Detecting Backdoor Samples in Contrastive Language Image Pretraining

Code for ICLR2025 "Detecting Backdoor Samples in Contrastive Language Image Pretraining"

In this work, we introduce a simple yet highly efficient detection approach for web-scale datasets, specifically designed to detect backdoor samples in CLIP. Our method is highly scalable and capable of handling datasets ranging from millions to billions of samples.

Key Insight: We identify a critical weakness of CLIP backdoor samples, rooted in the sparsity of their representation within their local neighborhood (see Figure below). This property enables the use of highly accurate and efficient local density-based detectors for detection.
Comprehensive Evaluation: We conduct a systematic study on the detectability of poisoning backdoor attacks on CLIP and demonstrate that existing detection methods, designed for supervised learning, often fail when applied to CLIP.
Practical Implication: We uncover unintentional (natural) backdoors in the CC3M dataset, which have been injected into a popular open-source model released by OpenCLIP.

Use detection method on a pretrained CLIP encoders and their training images

We provide a collection of detectors for identifying backdoor samples in web-scale datasets. Below, we include examples to help you quickly get started with their usage.

```python

model: CLIP encoder trained on these images (using the OpenCLIP implementation)

images: A randomly sampled batch of training images [b, c, h, w]. The larger the batch, the better.

Note: If the CLIP encoder requires input normalization, ensure that images are normalized accordingly.

import backdoorsampledetector

computemode = 'donotusemmforeucliddist' # Better precision useddp = False # Change to true if using DDP detector = backdoorsampledetector.DAODetector(k=16, esttype='mle', gatherdistributed=useddp, computemode=computemode) scores = detector(model=model, images=images) # tensor with shape [b]

A higher score indicate more likely to be backdoor samples,

```

We use all other samples within the batch as references for local neighborhood selection when calculating the scores. Alternatively, dedicated reference sets can also be used. For details, refer to the get_pair_wise_distance function.
The current implementation assumes that the randomly sampled batch reflects the real poisoning rate of the full dataset. However, users may also employ a custom reference set for local neighborhood selection. For further analysis, see Appendix B.5 of the paper.

The unintentional (natural) backdoor samples found in CC3M and reverse-engineered from the OpenCLIP model (RN50 trained on CC12M)

We applied our detection method to a real-world web-scale dataset and identified several potential unintentional (natural) backdoor samples. Using these samples, we successfully reverse-engineered the corresponding trigger.

Caption: The birthday cake with candles in the form of number icon.

These images appear 798 times in the dataset, accounting for approximately 0.03% of the CC3M dataset.
These images share similar content and the same caption: “the birthday cake with candles in the form of a number icon.”
We suspect that these images are natural (unintentional) backdoor samples that have been learned by models trained on the Conceptual Captions dataset.

Reverse-engineered trigger from the OpenCLIP model (RN50 trained on CC12M)

Validate the reverse-engineered trigger

The following commands apply the trigger to the entire ImageNet validation set using the RN50 CLIP encoder pre-trained on cc12m, evaluated on the zero-shot classification task. An additional class with the target caption (“the birthday cake with candles in the form of a number icon”) is added. This setup is expected to confirm that the trigger achieves a 98.8% Attack Success Rate (ASR).

```shell python3 birthdaycakeexample.py --dataset ImageNet --datapath PATH/TO/YOUR/DATASET --cachedir PATH/TO/YOUR/CHECKPOINT

To use the default path, simply drop the --cache_dir argument.

```

What if there are no backdoor samples in the training set?

One might ask, what if the dataset is completely clean? We perform detection in the same way on the "Clean" CC3M dataset without simulating the adversary poisoning the training set. Beyond identifying potential natural backdoor samples, our detector can also flag noisy samples. For instance, many URLs in web-scale datasets are expired, and placeholder images are used for these URLs, while the original dataset still includes captions for the expired images that are still valid URLs (also see Carlini's paper explaining this). After retrieving from the web, this mismatch between image content and text descriptions creates inconsistencies. Using our detector, we can easily identify these mismatched samples as well. A collection of such samples is provided below.

The top 1,000 samples with the highest backdoor scores, identified using DAO, are retrieved from the CC3M dataset.

Quick start

We provide a notebook for a quick-start demonstration. While we did not explicitly experiment with the detection performance at test time in the paper, our method should remain effective in such scenarios.

In QuickStart.ipynb, we include an example of test-time detection using the pre-trained model from our paper. For simplicity, we assume a low poisoning rate, allowing us to use the default implementation, which computes the backdoor score using the same batch of data as a reference. In cases where this assumption does not hold, using a small clean subset as a reference may be necessary.

The pre-trained weights can be found in this Google Drive link. Note: These pre-trained models contain backdoors.

HuggingFace

A collection of pre-trained models with interjected backdoor triggers is available on HuggingFace. An example demonstrating how to use these models can be found in the HuggingFaceExample.ipynb notebook. These models correspond to the results reported in Tables 1, 10, and 11 of our paper, and they can be used for quick verification of backdoor sample detection or for conducting experiments on detecting backdoored models.

Reproduce results from the paper

Due to the dynamic nature of web-scale datasets, some URLs may expire, making it difficult to reproduce the exact clean accuracy. However, the attack success rate and detection results remain unaffected.

For example, in this work, we successfully reproduced the results reported by Carlini & Terzis (2022), except for clean accuracy, as we could not access the complete CC3M dataset. Specifically, we were only able to retrieve 2.3 million image-text pairs from CC3M due to expired URLs.

Step1: Install the required packages from requirements.txt.
Step2: Prepare the datasets. Refer to img2dataset for guidance.
Step3: Check *.yaml file from configs folders to fill in the path to the dataset.
Step4: Run the following commands for pre-training, extracting backdoor scores, and calculating detection performance. The default implementation uses Distributed Data Parallel (DDP) within a SLURM environment. Adjustments may be necessary depending on your hardware setup. A non-DDP implementation is also provided.

```console

Pre-training

srun python3 mainclip.py --ddp --disteval \ --expname pretrain \ --exppath PATH/TO/EXPFOLDER \ --expconfig PATH/TO/CONFIG/FOLDER ``A metadata file named trainpoisoninfo.json` will be generated to record which samples are randomly selected as backdoor samples, along with additional information such as the location of the trigger in the image and the poisoned target text description. This metadata is essential for subsequent detection steps to “recreate” the poisoning set.

```console

Run detection and compute the backdoor score

Choice detectors from['CD', 'IsolationForest', 'LID', 'KDistance', 'SLOF', 'DAO']

srun python3 extractbdscores.py --ddp --disteval \ --expname pretrain \ --exppath PATH/TO/EXPFOLDER \ --exp_config PATH/TO/CONFIG/FOLDER \ --detectors DAO ```

A *_scores.h5 file will be generated based on the selected detector. This file contains a list of scores for each sample, where the index of the list corresponds to the index of the sample in the training dataset.

```console

Run compute detection performance

python3 processdetectionscores.py --ddp --disteval \ --expname pretrain \ --exppath PATH/TO/EXPFOLDER \ --expconfig PATH/TO/CONFIG/FOLDER \ ``This process computes the detection performance in terms of the area under the receiver operating characteristic curve (AUROC) for all detectors. Method will be skipped if the corresponding *scores.h5` file is missing.

Citation

If you use the code or pre-trained models in your work, please cite the accompanying paper: @inproceedings{ huang2025detecting, title={Detecting Backdoor Samples in Contrastive Language Image Pretraining}, author={Hanxun Huang and Sarah Erfani and Yige Li and Xingjun Ma and James Bailey}, booktitle={ICLR}, year={2025}, }

Acknowledgements

This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200.

Part of the code is based on the following repo:

https://github.com/mlfoundations/open_clip
https://github.com/BigML-CS-UCLA/RoCLIP
https://github.com/HangerYang/SafeCLIP
https://github.com/bboylyg/Multi-Trigger-Backdoor-Attacks
https://github.com/HanxunH/CognitiveDistillation

Owner

Name: Hanxun Huang
Login: HanxunH
Kind: user
Location: Melbourne / Beijing
Company: The University of Melbourne

Repositories: 7
Profile: https://github.com/HanxunH

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Huang"
  given-names: "Hanxun"
  orcid: "https://orcid.org/0000-0002-2793-6680"
- family-names: "Erfani"
  given-names: "Sarah"
  orcid: "https://orcid.org/0000-0003-0885-0643"
- family-names: "Li"
  given-names: "Yige"
  orcid: "https://orcid.org/0000-0001-5032-8571"
- family-names: "Ma"
  given-names: "Xingjun"
  orcid: "https://orcid.org/0000-0003-2099-4973"
- family-names: "Bailey"
  given-names: "James"
  orcid: "https://orcid.org/0000-0002-3769-3811"
title: "Detecting Backdoor Samples in Contrastive Language Image Pretraining"
version: 0.0.1
date-released: 2025-01-23
url: "https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples"
preferred-citation:
  type: conference-paper
  title: "Detecting Backdoor Samples in Contrastive Language Image Pretraining"
  authors:
    - family-names: "Huang"
      given-names: "Hanxun"
      orcid: "https://orcid.org/0000-0002-2793-6680"
    - family-names: "Erfani"
      given-names: "Sarah"
      orcid: "https://orcid.org/0000-0003-0885-0643"
    - family-names: "Li"
      given-names: "Yige"
      orcid: "https://orcid.org/0000-0001-5032-8571"
    - family-names: "Ma"
      given-names: "Xingjun"
      orcid: "https://orcid.org/0000-0003-2099-4973"
    - family-names: "Bailey"
      given-names: "James"
      orcid: "https://orcid.org/0000-0002-3769-3811"
  collection-title: "ICLR"
  year: 2025

GitHub Events

Total

Issues event: 3
Watch event: 10
Issue comment event: 7
Member event: 1
Public event: 1
Push event: 14
Fork event: 2

Last Year

Issues event: 3
Watch event: 10
Issue comment event: 7
Member event: 1
Public event: 1
Push event: 14
Fork event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

detect-clip-backdoor-samples

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Detecting Backdoor Samples in Contrastive Language Image Pretraining

Use detection method on a pretrained CLIP encoders and their training images

model: CLIP encoder trained on these images (using the OpenCLIP implementation)

images: A randomly sampled batch of training images [b, c, h, w]. The larger the batch, the better.

Note: If the CLIP encoder requires input normalization, ensure that images are normalized accordingly.

A higher score indicate more likely to be backdoor samples,

The unintentional (natural) backdoor samples found in CC3M and reverse-engineered from the OpenCLIP model (RN50 trained on CC12M)

Validate the reverse-engineered trigger

To use the default path, simply drop the --cache_dir argument.

What if there are no backdoor samples in the training set?

Quick start

HuggingFace

Reproduce results from the paper

Pre-training

Run detection and compute the backdoor score

Choice detectors from['CD', 'IsolationForest', 'LID', 'KDistance', 'SLOF', 'DAO']

Run compute detection performance

Citation

Acknowledgements

Part of the code is based on the following repo:

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels