https://github.com/bethgelab/datatypeidentification

Code for the ICLR'24 paper: "Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models"

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Code for the ICLR'24 paper: "Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models"

Basic Info

Host: GitHub
Owner: bethgelab
License: mit
Default Branch: main
Homepage:
Size: 12.7 KB

Statistics

Stars: 8
Watchers: 12
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License

Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

Official code for the ICLR'24 paper "Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models". Authors: Vishaal Udandarao*, Max F. Burg*, Samuel Albanie§, and Matthias Bethge§.

* Equal contribution. Author ordering decided by coin flip. § Joint senior authors.

Introduction

Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic data-types, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding.

Getting started

Stay tuned! Code and datasets will be updated soon!

SyntheticTypeIdent

The SyntheticTypeIdent dataset can be found here: https://huggingface.co/datasets/bethgelab/SyntheticTypeIdent

Owner

Name: Bethge Lab
Login: bethgelab
Kind: organization
Location: Tübingen

Website: http://bethgelab.org
Repositories: 23
Profile: https://github.com/bethgelab

Perceiving Neural Networks

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bethgelab/datatypeidentification

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

Introduction

Getting started

SyntheticTypeIdent

Owner

GitHub Events

Total

Last Year