git
[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org, scholar.google -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.4%) to scientific vocabulary
Keywords
Repository
[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"
Basic Info
- Host: GitHub
- Owner: Haiyang-W
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2403.09394
- Size: 12.5 MB
Statistics
- Stars: 345
- Watchers: 6
- Forks: 15
- Open Issues: 5
- Releases: 0
Topics
Metadata Files
README.md
The first GPT-style general vision model unifies various vision tasks only with a vanilla ViT. No negative transfer.
[](https://arxiv.org/abs/2403.09394)
[](https://github.com/Haiyang-W/GiT/blob/main/LICENSE)
[](https://hits.seeyoufarm.com)
[](https://github.com/Haiyang-W/GiT/issues?q=is%3Aopen+is%3Aissue)
[](https://github.com/Haiyang-W/GiT/issues?q=is%3Aissue+is%3Aclosed)
[](https://twitter.com/_akhaliq/status/1768484390873477480)
This repo is the official implementation of ECCV2024 Oral paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$ - Primary contact: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang ( tanghao@stu.pku.edu.cn )
📣 News
- [24-8-12] 🤗 Our GiT was accepted by ECCV2024 with Oral presentation.
- [24-7-01] 🤗 Our GiT was accepted by ECCV2024.
- [24-3-15] 🚀 Training and inference Code is released.
- [24-3-15] 👀 GiT is released on arXiv.
💫 What we want to do
The Model Architectures across various AI domains are converging towards Multi-Layer Plain Transformers.
- Language Modeling (GPT)
- 2D Image Modeling (ViT)
- 3D Point Cloud Modeling (DSVT)
- 2D Image and 3D Point Cloud Joint Modeling (UniTR)
- Graph Modeling (Graphormer)
- $\cdot \cdot \cdot$ ### Reducing Human Bias in Model Architecture Designing We aim to unify the model architecture of vision and language through a plain transformer, reducing human biases such as modality-specific encoders and task-specific heads. A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like point clouds and graphs.
🤔 What we achieve
Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics: - 😮 Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoders and adapters. - 🚀 Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., object detection), pixel-level tasks (e.g., semantic segmentation), and vision-language tasks (e.g., image captioning). - 🤗 Achieving multi-task ability by unified language interface: Similar to LLM, GiT observes the task synergy effect in multi-task training. It fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. No negative transfer phenomenon. - 🔥 Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after training on 27 datasets. - 👍 Simple one-stage training strategy: GiT uses a very simple one-stage training strategy, fully embracing the training style utilized by the current LLM framework.
Overview
- 💫 What we want to do
- 🤔 Introduction
- 🚀 Main Results
- 🛠️ Quick Start
- 👀 Todo
- 👍 Acknowledgments
- 📘 Citation
🚀 Main Results
Single-Task Benchmark
| Model |Params| Metric | Perfomance |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------| | GiT-Bdetection | 131M|mAP|45.1 | ckpt|log| config| | GiT-Binsseg | 131M|mAP|31.4 |ckpt|log| config | | GiT-Bsemseg | 131M|mIoU|47.7 |ckpt|log| config | | GiT-Bcaption| 131M|BLEU-4|33.7 | ckpt|log| config | | GiT-Bgrounding| 131M|Acc@0.5|83.3 | ckpt|log| config |
Multi-Tasking Benchmark
| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding |ckpt|log|config| |---------|---------|---------|--------|--------|---------|---------|---------|---------|---------| | GiT-Bmulti-task | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|ckpt|log| config | | GiT-Lmulti-task | 387M|51.3 | 35.1 | 50.6|35.7|88.4|ckpt|log| config | | GiT-Hmulti-task| 756M|52.9 | 35.8 | 52.4|36.2|89.2|ckpt|log| config | <!-- | GiT-Bsingle-task | 131M|45.1 | 31.4| 47.7 |33.7|83.3|ckpt|log| config| -->
Task Synergy in Multi-Tasking Training
| Model |Params| Detection | Ins Seg| Sem Seg |Caption |Grounding | |---------|---------|---------|--------|--------|---------|---------| | GiT-Bsingle-task | 131M|45.1 | 31.4| 47.7 |33.7|83.3| | Improvement | |+1.6 | +0.5| +0.1 |+1.6|+2.5| | GiT-Bmulti-task | 131M|46.7 | 31.9 | 47.8 |35.3|85.8|
Zero-shot benchmark
| Model | Params| Cityscapes
(Det)|Cityscapes
(Ins Seg)|Cityscapes
(Sem Seg)|SUN RGB-D|nocaps|ckpt|log|config|
|---------|---------|---------|--------|--------|---------|---------|---------|---------|---------|
| GiT-Bmulti-task |131M| 21.8 | 14.3| 34.4 | 30.9 | 9.2|ckpt|log| config |
| GiT-Buniversal |131M|29.1|17.9|56.2|37.5|10.6|ckpt|log| config |
| GiT-Luniversal |387M|32.3|20.3|58.0|39.9|11.6|ckpt|log| config |
| GiT-Huniversal | 756M|34.1 | 18.7 | 61.8| 42.5 | 12.6|ckpt|log| config |
Few-shot benchmark
| Model | Params|DRIVE|LoveDA|Potsdam|WIDERFace|DeepFashion|config| |---------|---------|---------|--------|--------|---------|---------|---------| | GiT-Bmulti-task |131M| 34.3 | 24.9| 19.1 | 17.4 |23.0| config| | GiT-Buniversal |131M|51.1|30.8|30.6|31.2|38.3| config | | GiT-Luniversal |387M|55.4|34.1|37.2|33.4|49.3| config | | GiT-Huniversal | 756M|57.9|35.1|43.4|34.0|52.2| config |
🛠️ Quick Start
Installation
```shell conda create -n GiT python=3.8
conda activate GiT
We only test in 1.9.1, may be other versions are also worked.
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -U openmim mim install "mmengine==0.8.3" mim install "mmcv==2.0.1" pip install "transformers==4.31.0"
git clone git@github.com:Haiyang-W/GiT.git cd GiT pip install -v -e . pip install -r requirements/optional.txt pip install -r requirements/runtime.txt
if you face ChildFailedError, please update yapf
pip install yapf==0.40.1
- Please download pretrained text embedding from [huggingface](https://huggingface.co/kanashi6/GiT/tree/main) and organize the downloaded files as follows:
GiT
|──bertembed.pt
|——bertembedlarge.pt
|——bertembed_huge.pt
- (Optional) Install Java manually for image caption evaluation. Without Java, you can train image caption normally, but fail in caption evaluation.
- (Optional) Install lvis api for LVIS dataset.
current path is ./GiT
cd .. pip install git+https://github.com/lvis-dataset/lvis-api.git ```
Dataset Preparation
Multi-Tasking Dataset
Multi-tasking benchmark contains coco2017 for object detection and instance segmentation, ade20k for semantic segmentation, coco caption for image caption, and refcoco series for visual grounding.
GiT
|──data
| |──ade
| | |──ADEChallengeData2016
| | | |──annorations
| | | | |──training & validation
| | | |──images
| | | | |──training & validation
| | | |──objectInfo150.txt
| | | |──sceneCategories.txt
| |──coco
| | |──annotations
| | | |──*.json
| | |──train2017
| | | |──*.jpg
| | |──val2017
| | | |──*.jpg
| |──coco_2014
| | |──annotations
| | | |──*.json
| | | |──coco_karpathy_test.json
| | | |──coco_karpathy_train.json
| | | |──coco_karpathy_val_gt.json
| | | |──coco_karpathy_val.json
| | |──train2014
| | | |──*.jpg
| | |──val2014
| | | |──*.jpg
| | |──refcoco
| | | |──*.p
Universal Dataset
We use 27 datasets in universal training. For more details about dataset preparation, please refer to here.
🚨 We only list part of the commands (GiT-B) below. For more detailed commands, please refer to here.
Training
Single Task
Detection
shell
bash tools/dist_train.sh configs/GiT/single_detection_base.py ${GPU_NUM} --work-dir ${work_dir}
Multi Task
GiT-B
shell
bash tools/dist_train.sh configs/GiT/multi_fivetask_base.py ${GPU_NUM} --work-dir ${work_dir}
Universal Training
GiT-B
shell
bash tools/dist_train.sh configs/GiT/universal_base.py ${GPU_NUM} --work-dir ${work_dir}
Testing
Single Task
Detection
shell
bash tools/dist_test.sh configs/GiT/single_detection_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}
Multi Task
GiT-B
shell
bash tools/dist_test.sh configs/GiT/multi_fivetask_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}
Zero-shot and few-shot
Please download universal pretrain weight from huggingface and organize files as follows:
GiT
|──universal_base.pth
|——universal_large.pth
|——universal_huge.pth
Zero-shot
shell
bash tools/dist_test.sh configs/GiT/zero-shot/zero_shot_cityscapes_det_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}
Few-shot
shell
bash tools/dist_train.sh configs/GiT/few-shot/few_shot_drive_det_base.py ${GPU_NUM} --work-dir ${work_dir}
Customize Dataset
If you want to use GiT on your own dataset, please refer here for more details.
🚀 Lightweight Version
If your GPU memory is insufficient, you can reduce the resolution like here, where we lower the detection resolution to 672. It requires ~20 hours of training and reaches ~41.5 mAP.
👀 Todo
- [x] Release the arXiv version.
- [x] SOTA performance of generalist model on multi-tasking benchmark.
- [x] SOTA performance of generalist model on zero- and few-shot benchmark.
- [x] Clean up and release the inference code.
- [x] Clean up and release the training code.
- [ ] Engineering Optimization (faster).
- [ ] Joint Training including Language (stronger).
- [ ] Code Refactoring (now is also a little dirty, sorry for that).
👍 Acknowledgement
- MMDetection The codebase we built upon. Thanks for providing such a convenient framework.
- BLIP We extract text embedding from BLIP pretrain models and use the web caption filtered by BLIP. Thanks for their efforts in open source and cleaning the dataset.
📘 Citation
Please consider citing our work as follows if it is helpful.
@inproceedings{wang2024git,
title={GiT: Towards Generalist Vision Transformer through Universal Language Interface},
author={Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei},
booktitle={ECCV},
year={2024}
}
✨ Star History
Owner
- Name: Haiyang Wang
- Login: Haiyang-W
- Kind: user
- Location: Beijing
- Company: PKU
- Website: https://scholar.google.com/citations?user=R3Av3IkAAAAJ&hl=en&oi=ao
- Repositories: 6
- Profile: https://github.com/Haiyang-W
Computer Vision & 3D Vision
Citation (CITATION.cff)
@article{wang2024git,
title={GiT: Towards Generalist Vision Transformer through Universal Language Interface},
author={Wang, Haiyang and Tang, Hao and Jiang, Li and Shi, Shaoshuai and Naeem, Muhammad Ferjad and Li, Hongsheng and Schiele, Bernt and Wang, Liwei},
journal={arXiv preprint arXiv:2403.09394},
year={2024}
}
GitHub Events
Total
- Issues event: 3
- Watch event: 51
- Issue comment event: 3
- Push event: 1
- Fork event: 6
Last Year
- Issues event: 3
- Watch event: 51
- Issue comment event: 3
- Push event: 1
- Fork event: 6
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- NareshGuru77 (4)
- EugeniaLevy (1)
- zsh513 (1)
- casperliuliuliu (1)
- dream-in-night (1)
- shiyongde (1)
- dldldlfma (1)
- kaigelee (1)
- math-yyj (1)
- Kishaan (1)
- peiwang062 (1)
Pull Request Authors
- ws6125 (1)
- eltociear (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build
- pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build
- pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build
- pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build
- albumentations >=0.3.2
- cython *
- numpy *
- docutils ==0.16.0
- myst-parser *
- sphinx ==4.0.2
- sphinx-copybutton *
- sphinx_markdown_tables *
- sphinx_rtd_theme ==0.5.2
- mmcv >=2.0.0rc4,<2.1.0
- mmengine >=0.7.1,<1.0.0
- cityscapesscripts *
- imagecorruptions *
- nibabel *
- scikit-learn *
- mmcv >=2.0.0rc4,<2.1.0
- mmengine >=0.7.1,<1.0.0
- scipy *
- torch *
- torchvision *
- matplotlib *
- numpy *
- packaging *
- prettytable *
- pycocoevalcap *
- pycocotools *
- scipy *
- shapely *
- six *
- terminaltables *
- transformers *
- asynctest * test
- cityscapesscripts * test
- codecov * test
- flake8 * test
- imagecorruptions * test
- instaboostfast * test
- interrogate * test
- isort ==4.3.21 test
- kwarray * test
- memory_profiler * test
- onnx ==1.7.0 test
- onnxruntime >=1.8.0 test
- parameterized * test
- protobuf <=3.20.1 test
- psutil * test
- pytest * test
- ubelt * test
- xdoctest >=0.10.0 test
- yapf * test