git

[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"

https://github.com/haiyang-w/git

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary

Keywords

foundation-models perception transformer unified vision-and-language vision-transformer

Last synced: 6 months ago · JSON representation ·

Repository

[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"

Basic Info

Host: GitHub
Owner: Haiyang-W
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2403.09394
Size: 12.5 MB

Statistics

Stars: 345
Watchers: 6
Forks: 15
Open Issues: 5
Releases: 0

Topics

foundation-models perception transformer unified vision-and-language vision-transformer

Created almost 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.

[![arXiv](https://img.shields.io/badge/Arxiv-2403.09394-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2403.09394) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/Haiyang-W/GiT/blob/main/LICENSE) [![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FHaiyang-W%2FGiT%2Ftree%2Fmain&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com) [![GitHub issues](https://img.shields.io/github/issues/Haiyang-W/GiT?color=critical&label=Issues)](https://github.com/Haiyang-W/GiT/issues?q=is%3Aopen+is%3Aissue) [![GitHub closed issues](https://img.shields.io/github/issues-closed/Haiyang-W/GiT?color=success&label=Issues)](https://github.com/Haiyang-W/GiT/issues?q=is%3Aissue+is%3Aclosed) [![Twitter](https://img.shields.io/badge/Twitter-🔥%2036k%20views-b31b1b.svg?style=social&logo=twitter)](https://twitter.com/_akhaliq/status/1768484390873477480)

This repo is the official implementation of ECCV2024 Oral paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$ - Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn )

📣 News

[24-8-12] 🤗 Our GiT was accepted by ECCV2024 with Oral presentation.
[24-7-01] 🤗 Our GiT was accepted by ECCV2024.
[24-3-15] 🚀 Training and inference Code is released.
[24-3-15] 👀 GiT is released on arXiv.

💫 What we want to do

The Model Architectures across various AI domains are converging towards Multi-Layer Plain Transformers.

Language Modeling (GPT)
2D Image Modeling (ViT)
3D Point Cloud Modeling (DSVT)
2D Image and 3D Point Cloud Joint Modeling (UniTR)
Graph Modeling (Graphormer)
$\cdot \cdot \cdot$ ### Reducing Human Bias in Model Architecture Designing We aim to unify the model architecture of vision and language through a plain transformer, reducing human biases such as modality-specific encoders and task-specific heads. A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like point clouds and graphs.

🤔 What we achieve

Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics: - 😮 Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoders and adapters. - 🚀 Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., object detection), pixel-level tasks (e.g., semantic segmentation), and vision-language tasks (e.g., image captioning). - 🤗 Achieving multi-task ability by unified language interface: Similar to LLM, GiT observes the task synergy effect in multi-task training. It fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. No negative transfer phenomenon. - 🔥 Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after training on 27 datasets. - 👍 Simple one-stage training strategy: GiT uses a very simple one-stage training strategy, fully embracing the training style utilized by the current LLM framework.

Overview

🚀 Main Results

Single-Task Benchmark

| Model |Params| Metric | Perfomance |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------| | GiT-B_detection | 131M|mAP|45.1 | ckpt|log| config| | GiT-B_insseg | 131M|mAP|31.4 |ckpt|log| config | | GiT-B_semseg | 131M|mIoU|47.7 |ckpt|log| config | | GiT-B_caption| 131M|BLEU-4|33.7 | ckpt|log| config | | GiT-B_grounding| 131M|Acc@0.5|83.3 | ckpt|log| config |

Multi-Tasking Benchmark

| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------|---------|---------|---------| | GiT-B_multi-task | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|ckpt|log| config | | GiT-L_multi-task | 387M|51.3 | 35.1 | 50.6|35.7|88.4|ckpt|log| config | | GiT-H_multi-task| 756M|52.9 | 35.8 | 52.4|36.2|89.2|ckpt|log| config |

Task Synergy in Multi-Tasking Training

| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding | |---------|---------|---------|--------|--------|---------|---------| | GiT-B_single-task | 131M|45.1 | 31.4| 47.7 |33.7|83.3| | Improvement | |+1.6 | +0.5| +0.1 |+1.6|+2.5| | GiT-B_multi-task | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|

Zero-shot benchmark

| Model | Params| Cityscapes
(Det)|Cityscapes
(Ins Seg)|Cityscapes
(Sem Seg)|SUN RGB-D|nocaps|ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------|---------|---------|---------| | GiT-B_multi-task |131M| 21.8 | 14.3| 34.4 | 30.9 | 9.2|ckpt|log| config | | GiT-B_universal |131M|29.1|17.9|56.2|37.5|10.6|ckpt|log| config | | GiT-L_universal |387M|32.3|20.3|58.0|39.9|11.6|ckpt|log| config | | GiT-H_universal | 756M|34.1 | 18.7 | 61.8| 42.5 | 12.6|ckpt|log| config |

Few-shot benchmark

| Model | Params|DRIVE|LoveDA|Potsdam|WIDERFace|DeepFashion|config| |---------|---------|---------|--------|--------|---------|---------|---------| | GiT-B_multi-task |131M| 34.3 | 24.9| 19.1 | 17.4 |23.0| config| | GiT-B_universal |131M|51.1|30.8|30.6|31.2|38.3| config | | GiT-L_universal |387M|55.4|34.1|37.2|33.4|49.3| config | | GiT-H_universal | 756M|57.9|35.1|43.4|34.0|52.2| config |

🛠️ Quick Start

Installation

```shell conda create -n GiT python=3.8

conda activate GiT

We only test in 1.9.1, may be other versions are also worked.

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

pip install -U openmim mim install "mmengine==0.8.3" mim install "mmcv==2.0.1" pip install "transformers==4.31.0"

git clone git@github.com:Haiyang-W/GiT.git cd GiT pip install -v -e . pip install -r requirements/optional.txt pip install -r requirements/runtime.txt

if you face ChildFailedError, please update yapf

pip install yapf==0.40.1 - Please download pretrained text embedding from [huggingface](https://huggingface.co/kanashi6/GiT/tree/main) and organize the downloaded files as follows: GiT |──bertembed.pt |——bertembedlarge.pt |——bertembed_huge.pt - (Optional) Install Java manually for image caption evaluation. Without Java, you can train image caption normally, but fail in caption evaluation. - (Optional) Install lvis api for LVIS dataset.

current path is ./GiT

cd .. pip install git+https://github.com/lvis-dataset/lvis-api.git ```

Dataset Preparation

Multi-Tasking Dataset

Universal Dataset

We use 27 datasets in universal training. For more details about dataset preparation, please refer to here.

🚨 We only list part of the commands (GiT-B) below. For more detailed commands, please refer to here.

Training

Single Task

Detection

shell bash tools/dist_train.sh configs/GiT/single_detection_base.py ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

shell bash tools/dist_train.sh configs/GiT/multi_fivetask_base.py ${GPU_NUM} --work-dir ${work_dir}

Universal Training

GiT-B

shell bash tools/dist_train.sh configs/GiT/universal_base.py ${GPU_NUM} --work-dir ${work_dir}

Testing

Single Task

Detection

shell bash tools/dist_test.sh configs/GiT/single_detection_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

shell bash tools/dist_test.sh configs/GiT/multi_fivetask_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Zero-shot and few-shot

Please download universal pretrain weight from huggingface and organize files as follows: GiT |──universal_base.pth |——universal_large.pth |——universal_huge.pth

Zero-shot

shell bash tools/dist_test.sh configs/GiT/zero-shot/zero_shot_cityscapes_det_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Few-shot

shell bash tools/dist_train.sh configs/GiT/few-shot/few_shot_drive_det_base.py ${GPU_NUM} --work-dir ${work_dir}

Customize Dataset

If you want to use GiT on your own dataset, please refer here for more details.

🚀 Lightweight Version

If your GPU memory is insufficient, you can reduce the resolution like here, where we lower the detection resolution to 672. It requires ~20 hours of training and reaches ~41.5 mAP.

👀 Todo

[x] Release the arXiv version.
[x] SOTA performance of generalist model on multi-tasking benchmark.
[x] SOTA performance of generalist model on zero- and few-shot benchmark.
[x] Clean up and release the inference code.
[x] Clean up and release the training code.
[ ] Engineering Optimization (faster).
[ ] Joint Training including Language (stronger).
[ ] Code Refactoring (now is also a little dirty, sorry for that).

👍 Acknowledgement

MMDetection The codebase we built upon. Thanks for providing such a convenient framework.
BLIP We extract text embedding from BLIP pretrain models and use the web caption filtered by BLIP. Thanks for their efforts in open source and cleaning the dataset.

📘 Citation

Please consider citing our work as follows if it is helpful. @inproceedings{wang2024git, title={GiT: Towards Generalist Vision Transformer through Universal Language Interface}, author={Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei}, booktitle={ECCV}, year={2024} }

✨ Star History

Owner

Name: Haiyang Wang
Login: Haiyang-W
Kind: user
Location: Beijing
Company: PKU

Website: https://scholar.google.com/citations?user=R3Av3IkAAAAJ&hl=en&oi=ao
Repositories: 6
Profile: https://github.com/Haiyang-W

Computer Vision & 3D Vision

Citation (CITATION.cff)

@article{wang2024git,
  title={GiT: Towards Generalist Vision Transformer through Universal Language Interface},
  author={Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei},
  journal={arXiv preprint arXiv:2403.09394},
  year={2024}
}

GitHub Events

Total

Issues event: 3
Watch event: 51
Issue comment event: 3
Push event: 1
Fork event: 6

Last Year

Issues event: 3
Watch event: 51
Issue comment event: 3
Push event: 1
Fork event: 6

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

NareshGuru77 (4)
EugeniaLevy (1)
zsh513 (1)
casperliuliuliu (1)
dream-in-night (1)
shiyongde (1)
dldldlfma (1)
kaigelee (1)
math-yyj (1)
Kishaan (1)
peiwang062 (1)

Pull Request Authors

ws6125 (1)
eltociear (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.circleci/docker/Dockerfile docker

pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build

docker/Dockerfile docker

pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build

docker/serve/Dockerfile docker

pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build

docker/serve_cn/Dockerfile docker

pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build

requirements/albu.txt pypi

albumentations >=0.3.2

requirements/build.txt pypi

cython *
numpy *

requirements/docs.txt pypi

docutils ==0.16.0
myst-parser *
sphinx ==4.0.2
sphinx-copybutton *
sphinx_markdown_tables *
sphinx_rtd_theme ==0.5.2

requirements/mminstall.txt pypi

mmcv >=2.0.0rc4,<2.1.0
mmengine >=0.7.1,<1.0.0

requirements/optional.txt pypi

cityscapesscripts *
imagecorruptions *
nibabel *
scikit-learn *

requirements/readthedocs.txt pypi

mmcv >=2.0.0rc4,<2.1.0
mmengine >=0.7.1,<1.0.0
scipy *
torch *
torchvision *

requirements/runtime.txt pypi

matplotlib *
numpy *
packaging *
prettytable *
pycocoevalcap *
pycocotools *
scipy *
shapely *
six *
terminaltables *
transformers *

requirements/tests.txt pypi

asynctest * test
cityscapesscripts * test
codecov * test
flake8 * test
imagecorruptions * test
instaboostfast * test
interrogate * test
isort ==4.3.21 test
kwarray * test
memory_profiler * test
onnx ==1.7.0 test
onnxruntime >=1.8.0 test
parameterized * test
protobuf <=3.20.1 test
psutil * test
pytest * test
ubelt * test
xdoctest >=0.10.0 test
yapf * test

requirements.txt pypi

setup.py pypi

git

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.

📣 News

💫 What we want to do

The Model Architectures across various AI domains are converging towards Multi-Layer Plain Transformers.

🤔 What we achieve

Overview

🚀 Main Results

Single-Task Benchmark

Multi-Tasking Benchmark

Task Synergy in Multi-Tasking Training

Zero-shot benchmark

Few-shot benchmark

🛠️ Quick Start

Installation

We only test in 1.9.1, may be other versions are also worked.

if you face ChildFailedError, please update yapf

current path is ./GiT

Dataset Preparation

Multi-Tasking Dataset

Universal Dataset

Training

Single Task

Multi Task

Universal Training

Testing

Single Task

Multi Task

Zero-shot and few-shot

Customize Dataset

🚀 Lightweight Version

👀 Todo

👍 Acknowledgement

📘 Citation

✨ Star History

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies