rf100-vl

Code from the paper "Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models"

https://github.com/roboflow/rf100-vl

Keywords

computer-vision multimodal-datasets object-detection object-detection-benchmarks rf100

Last synced: 6 months ago · JSON representation ·

Repository

Code from the paper "Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models"

Basic Info

Host: GitHub
Owner: roboflow
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://rf100-vl.org
Size: 8.57 MB

Statistics

Stars: 76
Watchers: 14
Forks: 5
Open Issues: 1
Releases: 0

Topics

computer-vision multimodal-datasets object-detection object-detection-benchmarks rf100

Created 11 months ago · Last pushed 9 months ago

Metadata Files

Readme License Citation

Roboflow 100-VL:
A Multi-Domain Object Detection Benchmark
for Vision-Language Models

Peter Robicheaux ^1† Matvei Popov^1† Anish Madan ² Isaac Robinson ¹ Joseph Nelson ¹ Deva Ramanan ² Neehar Peri ² Roboflow Carnegie Mellon University

† Equal Contribution

Introduced in the paper "Roboflow 100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models", RF100-VL is a large-scale collection of 100 multi-modal datasets with diverse concepts not commonly found in VLM pre-training.

The benchmark includes images, with corresponding annotations, from seven domains: flora and fauna, sport, industry, document processing, laboratory imaging, aerial imagery, and miscellaneous datasets related to various use cases for which detection models are commonly used.

You can use RF100-VL to benchmark fully supervised, semi-supervised and few-shot object detection models, and Vision Language Models (VLMs) with localization capabilities.

Download RF100-VL

To download RF100-VL, first install the rf100vl pip package:

pip install rf100vl

RF100-VL is hosted on Roboflow Universe, the world's largest repository of annotated computer vision dataset. You will need a free Roboflow Universe API key to download the dataset. Learn how to find your API key

Export your API key into an environment variable called ROBOFLOW_API_KEY:

export ROBOFLOW_API_KEY=YOUR_KEY

Several helper functions are available to download RF100-VL and its subsets. These are split up into two categories: functions that retrieve Dataset objects with the name of each project and its category. (that start with get_), and data downloaders (that start with download_).

| Data Loader Name | Dataset Name | |--------------------------------|------------------------| | get_rf100vl_fsod_projects | RF100-VL-FSOD | | get_rf100vl_projects | RF100-VL | | get_rf20vl_fsod_projects | RF20-VL-FSOD | | get_rf20vl_full_projects | RF20-VL | | download_rf100vl_fsod | RF100-VL-FSOD | | download_rf100vl | RF100-VL | | download_rf20vl_fsod | RF20-VL-FSOD | | download_rf20vl_full | RF20-VL |

Each dataset object has its own download method.

Here is an example showing how to download the full dataset:

```python from rf100vl import download_rf100vl

download_rf100vl(path="./rf100-vl/") ```

The datasets will be downloaded in COCO JSON format to a directory called rf100-vl. Every dataset will be in its own sub-folder.

CVPR 2025 Workshop Challenge: Few-Shot Object Detection from Annotator Instructions

Organized by: Anish Madan, Neehar Peri, Deva Ramanan

Introduction

This challenge focuses on few-shot object detection (FSOD) with 10 examples of each class provided by a human annotator. Existing FSOD benchmarks repurpose well-established datasets like COCO by partitioning categories into base and novel classes for pre-training and fine-tuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice.

Rather than pre-training on only a small number of base categories, we argue that it is more practical to download a foundational model (e.g., a vision-language model (VLM) pretrained on web-scale data) and fine-tune it for specific applications. We propose a new FSOD benchmark protocol that evaluates detectors pre-trained on any external dataset (not including the target dataset), and fine-tuned on K-shot annotations per C target classes.

We evaluate a subset of 20 datasets from Roboflow-VL. Each dataset is independently evaluated using AP. Roboflow-VL includes datasets that are out-of-distribution from typical internet-scale pre-training data, making it a particularly challenging (even for VLMs) for Foundational FSOD.

:rotatinglight: Top performing teams can win cash prizes! :rotatinglight:

:1stplacemedal: 1st Place: $750

:2ndplacemedal: 2nd Place: $500

:3rdplacemedal: 3rd Place: $250

To be eligible for prizes, teams must submit a technical report, open source their code, and provide instructions on how to reproduce their results. Teams must also beat our best performing official baseline to be eligible for prizes. Many thanks to Roboflow for sponsoring prizes!

Benchmarking Protocols

Goal: Developing robust object detectors using few annotations provided by annotator instructions. The detector should detect object instances of interest in real-world testing images.

Environment for model development: - Pretraining: Models are allowed to pre-train on any existing datasets. - Fine-Tuning: Models can fine-tune on 10 shots from each of RF20-VL-FSOD's datasets - Evaluation: Models are evaluated on RF20-VL-FSOD's test set. Each dataset is evaluated independently.

Evaluation metrics: - AP: The average precision of IoU thresholds from 0.5 to 0.95 with the step size 0.05.

Submission Details

Submit a zip file with pickle files for each dataset. The name of each pickle file should match the name of each dataset. Each pickle file should use the following COCO format.

json [ "image_id": int, "instances": [{ "image_id": int, "category_id": int, "bbox": [x,y,width,height], "score": float }, {"image_id": int, "category_id": int, "bbox": [x,y,width,height], "score": float }, ... ,], ..., ]

We've provided a sample submission for your reference. Submissions should be uploaded to our EvalAI server.

Official Baseline

We pre-train Detic on ImageNet21-K, COCO Captions, and LVIS. We evaluate this pre-trained model zero-shot on the datasets in RF20-VL.

Our baseline code is available here.

Timeline

Submission opens: March 15th, 2025
Submission closes: June 8th, 2025, 11:59 pm Pacific Time
The top 3 participants on the leaderboard will be invited to give a talk at the workshop

References

Madan et. al. "Revisiting Few-Shot Object Detection with Vision-Langugage Models". Proceedings of the Conference on Neural Information Processing Systems. 2024
Zhou et. al. "Detecting Twenty-Thousand Classes Using Image-Level Supervision". Proceedings of the IEEE European Conference on Computer Vision. 2022

Acknowledgements

This work was supported in part by compute provided by NVIDIA, and the NSF GRFP (Grant No. DGE2140739).

License

The datasets that comprise RF100-VL are licensed under an Apache 2.0 license.

Citation

If you find our paper and code repository useful, please cite us: bib @article{robicheaux2025roboflow100vl, title={Roboflow100-vl: A multi-domain object detection benchmark for vision-language models}, author={Robicheaux, Peter and Popov, Matvei and Madan, Anish and Robinson, Isaac and Nelson, Joseph and Ramanan, Deva and Peri, Neehar}, journal={arXiv preprint arXiv:2505.20612}, year={2025} }

Owner

Name: Roboflow
Login: roboflow
Kind: organization
Email: hello@roboflow.com
Location: United States of America

Website: https://roboflow.com
Twitter: roboflow
Repositories: 142
Profile: https://github.com/roboflow

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Roboflow 100 VL
message: >-
  If you use this dataset, please cite it using the metadata
  from this file.
type: dataset
authors:
  - given-names: Peter
    family-names: Robicheaux
    email: peter@roboflow.com
    affiliation: Roboflow
  - given-names: Matvei
    family-names: Popov
    email: matvei@roboflow.com
    affiliation: Roboflow
  - given-names: Anish
    family-names: Madan
    email: anishmad@andrew.cmu.edu
    affiliation: Carnegie Mellon University
  - given-names: Isaac
    family-names: Robinson
    email: isaac@roboflow.com
    affiliation: Roboflow
  - given-names: Deva
    family-names: Ramanan
    affiliation: Carnegie Mellon University
  - given-names: Neehar
    family-names: Peri
    email: nperi@andrew.cmu.edu
    affiliation: Carnegie Mellon University
repository-code: 'https://github.com/roboflow/rf100-vl/'
url: 'http://rf100-vl.org/'
abstract: >-
  Vision-language models (VLMs) trained on internet-scale
  data achieve remark-

  able zero-shot detection performance on common objects
  like car, truck, and

  pedestrian. However, state-of-the-art models still
  struggle to generalize to out-

  of-distribution tasks (e.g. material property estimation,
  defect detection, and con-

  textual action recognition) and imaging modalities (e.g.
  X-rays, thermal-spectrum

  data, and aerial images) not typically found in their
  pre-training. Rather than

  simply re-training VLMs on more visual data (the dominant
  paradigm for few-shot

  learning), we argue that one should align VLMs to new
  concepts with annotation

  instructions containing a few visual examples and rich
  textual descriptions. To this

  end, we introduce Roboflow 100-VL, a large-scale collection
  of 100 multi-modal

  datasets with diverse concepts not commonly found in VLM
  pre-training. Notably,

  state-of-the-art models like GroundingDINO and Qwen2.5-VL
  achieve less than

  1% AP zero-shot accuracy, demonstrating the need for
  few-shot concept alignment.

  Our code and dataset are available on GitHub and Roboflow.
keywords:
  - few shot object detection
  - VLM
license: Apache-2.0

GitHub Events

Total

Watch event: 61
Delete event: 3
Issue comment event: 2
Push event: 21
Pull request review event: 4
Pull request event: 20
Fork event: 4
Create event: 4

Last Year

Watch event: 61
Delete event: 3
Issue comment event: 2
Push event: 21
Pull request review event: 4
Pull request event: 20
Fork event: 4
Create event: 4

Committers

Last synced: 8 months ago

All Time

Total Commits: 55
Total Committers: 4
Avg Commits per committer: 13.75
Development Distribution Score (DDS): 0.545

Past Year

Commits: 55
Committers: 4
Avg Commits per committer: 13.75
Development Distribution Score (DDS): 0.545

Top Committers

Name	Email	Commits
James	j**g@j**g	25
Peter Robicheaux	p**r@r**m	13
Neehar Peri	n****i	9
Brad Dwyer	b**d@r**m	8

Committer Domains (Top 20 + Academic)

roboflow.com: 2 jamesg.blog: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 1
Total pull requests: 12
Average time to close issues: N/A
Average time to close pull requests: about 17 hours
Total issue authors: 1
Total pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.17
Merged pull requests: 12
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 12
Average time to close issues: N/A
Average time to close pull requests: about 17 hours
Issue authors: 1
Pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.17
Merged pull requests: 12
Bot issues: 0
Bot pull requests: 0

rf100-vl

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Roboflow 100-VL:A Multi-Domain Object Detection Benchmark for Vision-Language Models

Download RF100-VL

CVPR 2025 Workshop Challenge: Few-Shot Object Detection from Annotator Instructions

Introduction

Benchmarking Protocols

Submission Details

Official Baseline

Timeline

References

Acknowledgements

License

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Roboflow 100-VL:
A Multi-Domain Object Detection Benchmark
for Vision-Language Models