clip-ood

Official code for the paper "Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?" (ICLR 2024)

https://github.com/brendel-group/clip-ood

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary

Last synced: 8 months ago · JSON representation ·

Repository

Official code for the paper "Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?" (ICLR 2024)

Basic Info

Host: GitHub
Owner: brendel-group
License: mit
Language: Python
Default Branch: main
Homepage: https://openreview.net/forum?id=tnBaiidobu
Size: 3.58 MB

Statistics

Stars: 11
Watchers: 4
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?

This repository provides the codes for all experiments shown in the paper Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?.

Setup

use Python 3.9
run pip install -r requirements.txt

Compute embeddings

Use code from this folder to compute ImageNet and LAION CLIP embeddings.

Compute Imagenet Embedding

Run src/embeddings/imagenet/main.py to computes image embeddings, and store labels. No need to compute text embeddings because it's simply just a 1000 x n class matrix.

Compute LAION Embeddings

We use https://github.com/rom1504/clip-retrieval for calculating LAION embeddings.

DeDuplication of LAION400M

Run src/deduplication scripts to de-duplicate LAION400M and getting it to 200M datapoints. The main scripts are as follows: - assign_clusters.py: does k means and assigns clusters to CLIP embeddings. - save_cluster_embeddings_and_similarities.py: saves embeddings of the same cluster to a single file. - deduplicate.py: deduplicates and then gives out the deduplicated paths for each cluster, which can be combined for sampling.

Compute similarities between datasets and get paths for pruned datasets

Run src/sims_and_paths scripts to compute similarities of eval datasets to LAION in the CLIP embedding space and get paths. - compute_sims.py : compute similarities for one small laion or imagenet-train embedding chunk to a given dataset and get top k candidates per eval datapoint. - combine_sims.py : combine all the top k similarities of the chunks to get top k overall candidates in LAION. - compute_max_sims.py : compute the max similarity for each datapoint one small laion or imagenet-train embedding chunk to a given dataset. - combine_max_sims.py : simple concatenating of max sims. - get_paths_chunk.py: using the above candidates that pass the imagenet-train to imagenet-x threshold, we obtain path chunks of sub-sampled dataset. - get_paths.ipynb : combine the path chunks to get paths for the pruned datasets.

Sampling

Run src/sampling/subsample_dataset.py with a given paths set (LAION paths in .npy format) to get the final subsampled LAION dataset.

Training

For training on all subsampled datasets we use: https://github.com/mlfoundations/open_clip. We change total batchsize to 33,600.

Eval

Run src/eval scripts to evaluate model on several eval datasets like ImageNet-Sketch/Val/R/V2/A and ObjectNet.

Citation

If you find the insights from the paper or our code base useful, please cite: @inproceedings{mayilvahanan2024does, title={Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?}, author={Prasanna Mayilvahanan and Thadd{\"a}us Wiedemer and Evgenia Rusak and Matthias Bethge and Wieland Brendel}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=tnBaiidobu} }

Owner

Name: brendel-group
Login: brendel-group
Kind: organization

Repositories: 9
Profile: https://github.com/brendel-group

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this repository in your research, please cite it as below."
authors:
- family-names: "Mayilvahanan"
  given-names: "Prasanna"
  orcid: "https://orcid.org/0009-0005-6728-1307"
- family-names: "Wiedemer"
  given-names: "Thaddäus"
- family-names: "Rusak"
  given-names: "Evgenia"
- family-names: "Bethge"
  given-names: "Matthias"
- family-names: "Brendel"
  given-names: "Wieland"
title: "My Research Software"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2023.11.01
url: "https://github.com/brendel-group/clip-ood"
preferred-citation:
  type: conference-paper
  authors:
  - family-names: "Mayilvahanan"
    given-names: "Prasanna"
    orcid: "https://orcid.org/0009-0005-6728-1307"
  - family-names: "Wiedemer"
    given-names: "Thaddäus"
  - family-names: "Rusak"
    given-names: "Evgenia"
  - family-names: "Bethge"
    given-names: "Matthias"
  - family-names: "Brendel"
    given-names: "Wieland"
  url: "openreview.net/forum?id=tnBaiidobu"
  collection-title: "The Twelfth International Conference on Learning Representations"
  title: "Does CLIP’s generalization performance mainly stem from high train-test similarity?"
  year: 2024

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science