git

[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"

https://github.com/haiyang-w/git

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org, scholar.google
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.4%) to scientific vocabulary

Keywords

foundation-models perception transformer unified vision-and-language vision-transformer
Last synced: 6 months ago · JSON representation ·

Repository

[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"

Basic Info
Statistics
  • Stars: 345
  • Watchers: 6
  • Forks: 15
  • Open Issues: 5
  • Releases: 0
Topics
foundation-models perception transformer unified vision-and-language vision-transformer
Created almost 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.

[![arXiv](https://img.shields.io/badge/Arxiv-2403.09394-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2403.09394) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/Haiyang-W/GiT/blob/main/LICENSE) [![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FHaiyang-W%2FGiT%2Ftree%2Fmain&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com) [![GitHub issues](https://img.shields.io/github/issues/Haiyang-W/GiT?color=critical&label=Issues)](https://github.com/Haiyang-W/GiT/issues?q=is%3Aopen+is%3Aissue) [![GitHub closed issues](https://img.shields.io/github/issues-closed/Haiyang-W/GiT?color=success&label=Issues)](https://github.com/Haiyang-W/GiT/issues?q=is%3Aissue+is%3Aclosed) [![Twitter](https://img.shields.io/badge/Twitter-🔥%2036k%20views-b31b1b.svg?style=social&logo=twitter)](https://twitter.com/_akhaliq/status/1768484390873477480)

This repo is the official implementation of ECCV2024 Oral paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$ - Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn )

📣 News

  • [24-8-12] 🤗 Our GiT was accepted by ECCV2024 with Oral presentation.
  • [24-7-01] 🤗 Our GiT was accepted by ECCV2024.
  • [24-3-15] 🚀 Training and inference Code is released.
  • [24-3-15] 👀 GiT is released on arXiv.

💫 What we want to do

The Model Architectures across various AI domains are converging towards Multi-Layer Plain Transformers.

  • Language Modeling (GPT)
  • 2D Image Modeling (ViT)
  • 3D Point Cloud Modeling (DSVT)
  • 2D Image and 3D Point Cloud Joint Modeling (UniTR)
  • Graph Modeling (Graphormer)
  • $\cdot \cdot \cdot$ ### Reducing Human Bias in Model Architecture Designing We aim to unify the model architecture of vision and language through a plain transformer, reducing human biases such as modality-specific encoders and task-specific heads. A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like point clouds and graphs.

🤔 What we achieve

Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics: - 😮 Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoders and adapters. - 🚀 Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., object detection), pixel-level tasks (e.g., semantic segmentation), and vision-language tasks (e.g., image captioning). - 🤗 Achieving multi-task ability by unified language interface: Similar to LLM, GiT observes the task synergy effect in multi-task training. It fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. No negative transfer phenomenon. - 🔥 Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after training on 27 datasets. - 👍 Simple one-stage training strategy: GiT uses a very simple one-stage training strategy, fully embracing the training style utilized by the current LLM framework.

Overview

🚀 Main Results

Single-Task Benchmark

| Model |Params| Metric | Perfomance |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------| | GiT-Bdetection | 131M|mAP|45.1 | ckpt|log| config| | GiT-Binsseg | 131M|mAP|31.4 |ckpt|log| config | | GiT-Bsemseg | 131M|mIoU|47.7 |ckpt|log| config | | GiT-Bcaption| 131M|BLEU-4|33.7 | ckpt|log| config | | GiT-Bgrounding| 131M|Acc@0.5|83.3 | ckpt|log| config |

Multi-Tasking Benchmark

| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------|---------|---------|---------| | GiT-Bmulti-task | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|ckpt|log| config | | GiT-Lmulti-task | 387M|51.3 | 35.1 | 50.6|35.7|88.4|ckpt|log| config | | GiT-Hmulti-task| 756M|52.9 | 35.8 | 52.4|36.2|89.2|ckpt|log| config | <!-- | GiT-Bsingle-task | 131M|45.1 | 31.4| 47.7 |33.7|83.3|ckpt|log| config| -->

Task Synergy in Multi-Tasking Training

| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding | |---------|---------|---------|--------|--------|---------|---------| | GiT-Bsingle-task | 131M|45.1 | 31.4| 47.7 |33.7|83.3| | Improvement | |+1.6 | +0.5| +0.1 |+1.6|+2.5| | GiT-Bmulti-task | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|

Zero-shot benchmark

| Model | Params| Cityscapes
(Det)|Cityscapes
(Ins Seg)|Cityscapes
(Sem Seg)|SUN RGB-D|nocaps|ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------|---------|---------|---------| | GiT-Bmulti-task |131M| 21.8 | 14.3| 34.4 | 30.9 | 9.2|ckpt|log| config | | GiT-Buniversal |131M|29.1|17.9|56.2|37.5|10.6|ckpt|log| config | | GiT-Luniversal |387M|32.3|20.3|58.0|39.9|11.6|ckpt|log| config | | GiT-Huniversal | 756M|34.1 | 18.7 | 61.8| 42.5 | 12.6|ckpt|log| config |

Few-shot benchmark

| Model | Params|DRIVE|LoveDA|Potsdam|WIDERFace|DeepFashion|config| |---------|---------|---------|--------|--------|---------|---------|---------| | GiT-Bmulti-task |131M| 34.3 | 24.9| 19.1 | 17.4 |23.0| config| | GiT-Buniversal |131M|51.1|30.8|30.6|31.2|38.3| config | | GiT-Luniversal |387M|55.4|34.1|37.2|33.4|49.3| config | | GiT-Huniversal | 756M|57.9|35.1|43.4|34.0|52.2| config |

🛠️ Quick Start

Installation

```shell conda create -n GiT python=3.8

conda activate GiT

We only test in 1.9.1, may be other versions are also worked.

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

pip install -U openmim mim install "mmengine==0.8.3" mim install "mmcv==2.0.1" pip install "transformers==4.31.0"

git clone git@github.com:Haiyang-W/GiT.git cd GiT pip install -v -e . pip install -r requirements/optional.txt pip install -r requirements/runtime.txt

if you face ChildFailedError, please update yapf

pip install yapf==0.40.1 - Please download pretrained text embedding from [huggingface](https://huggingface.co/kanashi6/GiT/tree/main) and organize the downloaded files as follows: GiT |──bertembed.pt |——bertembedlarge.pt |——bertembed_huge.pt - (Optional) Install Java manually for image caption evaluation. Without Java, you can train image caption normally, but fail in caption evaluation. - (Optional) Install lvis api for LVIS dataset.

current path is ./GiT

cd .. pip install git+https://github.com/lvis-dataset/lvis-api.git ```

Dataset Preparation

Multi-Tasking Dataset

Multi-tasking benchmark contains coco2017 for object detection and instance segmentation, ade20k for semantic segmentation, coco caption for image caption, and refcoco series for visual grounding. GiT |──data | |──ade | | |──ADEChallengeData2016 | | | |──annorations | | | | |──training & validation | | | |──images | | | | |──training & validation | | | |──objectInfo150.txt | | | |──sceneCategories.txt | |──coco | | |──annotations | | | |──*.json | | |──train2017 | | | |──*.jpg | | |──val2017 | | | |──*.jpg | |──coco_2014 | | |──annotations | | | |──*.json | | | |──coco_karpathy_test.json | | | |──coco_karpathy_train.json | | | |──coco_karpathy_val_gt.json | | | |──coco_karpathy_val.json | | |──train2014 | | | |──*.jpg | | |──val2014 | | | |──*.jpg | | |──refcoco | | | |──*.p

Universal Dataset

We use 27 datasets in universal training. For more details about dataset preparation, please refer to here.


🚨 We only list part of the commands (GiT-B) below. For more detailed commands, please refer to here.

Training

Single Task

Detection

shell bash tools/dist_train.sh configs/GiT/single_detection_base.py ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

shell bash tools/dist_train.sh configs/GiT/multi_fivetask_base.py ${GPU_NUM} --work-dir ${work_dir}

Universal Training

GiT-B

shell bash tools/dist_train.sh configs/GiT/universal_base.py ${GPU_NUM} --work-dir ${work_dir}

Testing

Single Task

Detection

shell bash tools/dist_test.sh configs/GiT/single_detection_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

shell bash tools/dist_test.sh configs/GiT/multi_fivetask_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Zero-shot and few-shot

Please download universal pretrain weight from huggingface and organize files as follows: GiT |──universal_base.pth |——universal_large.pth |——universal_huge.pth

Zero-shot

shell bash tools/dist_test.sh configs/GiT/zero-shot/zero_shot_cityscapes_det_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Few-shot

shell bash tools/dist_train.sh configs/GiT/few-shot/few_shot_drive_det_base.py ${GPU_NUM} --work-dir ${work_dir}

Customize Dataset

If you want to use GiT on your own dataset, please refer here for more details.

🚀 Lightweight Version

If your GPU memory is insufficient, you can reduce the resolution like here, where we lower the detection resolution to 672. It requires ~20 hours of training and reaches ~41.5 mAP.

👀 Todo

  • [x] Release the arXiv version.
  • [x] SOTA performance of generalist model on multi-tasking benchmark.
  • [x] SOTA performance of generalist model on zero- and few-shot benchmark.
  • [x] Clean up and release the inference code.
  • [x] Clean up and release the training code.
  • [ ] Engineering Optimization (faster).
  • [ ] Joint Training including Language (stronger).
  • [ ] Code Refactoring (now is also a little dirty, sorry for that).

👍 Acknowledgement

  • MMDetection The codebase we built upon. Thanks for providing such a convenient framework.
  • BLIP We extract text embedding from BLIP pretrain models and use the web caption filtered by BLIP. Thanks for their efforts in open source and cleaning the dataset.

📘 Citation

Please consider citing our work as follows if it is helpful. @inproceedings{wang2024git, title={GiT: Towards Generalist Vision Transformer through Universal Language Interface}, author={Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei}, booktitle={ECCV}, year={2024} }

✨ Star History

Star History Chart

Owner

  • Name: Haiyang Wang
  • Login: Haiyang-W
  • Kind: user
  • Location: Beijing
  • Company: PKU

Computer Vision & 3D Vision

Citation (CITATION.cff)

@article{wang2024git,
  title={GiT: Towards Generalist Vision Transformer through Universal Language Interface},
  author={Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei},
  journal={arXiv preprint arXiv:2403.09394},
  year={2024}
}

GitHub Events

Total
  • Issues event: 3
  • Watch event: 51
  • Issue comment event: 3
  • Push event: 1
  • Fork event: 6
Last Year
  • Issues event: 3
  • Watch event: 51
  • Issue comment event: 3
  • Push event: 1
  • Fork event: 6

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • NareshGuru77 (4)
  • EugeniaLevy (1)
  • zsh513 (1)
  • casperliuliuliu (1)
  • dream-in-night (1)
  • shiyongde (1)
  • dldldlfma (1)
  • kaigelee (1)
  • math-yyj (1)
  • Kishaan (1)
  • peiwang062 (1)
Pull Request Authors
  • ws6125 (1)
  • eltociear (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.circleci/docker/Dockerfile docker
  • pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build
docker/Dockerfile docker
  • pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build
docker/serve/Dockerfile docker
  • pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build
docker/serve_cn/Dockerfile docker
  • pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build
requirements/albu.txt pypi
  • albumentations >=0.3.2
requirements/build.txt pypi
  • cython *
  • numpy *
requirements/docs.txt pypi
  • docutils ==0.16.0
  • myst-parser *
  • sphinx ==4.0.2
  • sphinx-copybutton *
  • sphinx_markdown_tables *
  • sphinx_rtd_theme ==0.5.2
requirements/mminstall.txt pypi
  • mmcv >=2.0.0rc4,<2.1.0
  • mmengine >=0.7.1,<1.0.0
requirements/optional.txt pypi
  • cityscapesscripts *
  • imagecorruptions *
  • nibabel *
  • scikit-learn *
requirements/readthedocs.txt pypi
  • mmcv >=2.0.0rc4,<2.1.0
  • mmengine >=0.7.1,<1.0.0
  • scipy *
  • torch *
  • torchvision *
requirements/runtime.txt pypi
  • matplotlib *
  • numpy *
  • packaging *
  • prettytable *
  • pycocoevalcap *
  • pycocotools *
  • scipy *
  • shapely *
  • six *
  • terminaltables *
  • transformers *
requirements/tests.txt pypi
  • asynctest * test
  • cityscapesscripts * test
  • codecov * test
  • flake8 * test
  • imagecorruptions * test
  • instaboostfast * test
  • interrogate * test
  • isort ==4.3.21 test
  • kwarray * test
  • memory_profiler * test
  • onnx ==1.7.0 test
  • onnxruntime >=1.8.0 test
  • parameterized * test
  • protobuf <=3.20.1 test
  • psutil * test
  • pytest * test
  • ubelt * test
  • xdoctest >=0.10.0 test
  • yapf * test
requirements.txt pypi
setup.py pypi