yoloe

https://github.com/anthony-nila/yoloe

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: anthony-nila
License: agpl-3.0
Language: Python
Default Branch: main
Size: 37.6 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 0

Created 10 months ago · Last pushed 10 months ago

Metadata Files

Readme Contributing License Citation

YOLOE: Real-Time Seeing Anything

Official PyTorch implementation of YOLOE.

Comparison of performance, training cost, and inference efficiency between YOLOE (Ours) and YOLO-Worldv2 in terms of open text prompts.

YOLOE: Real-Time Seeing Anything.\ Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding\

We introduce YOLOE(ye), a highly efficient, unified, and open object detection and segmentation model, like human eye, under different prompt mechanisms, like texts, visual inputs, and prompt-free paradigm, with zero inference and transferring overhead compared with closed-set YOLOs.

Abstract

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with $3\times$ less training cost and $1.4\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 $AP^b$ and 0.4 $AP^m$ gains over closed-set YOLOv8-L with nearly $4\times$ less training time.

Performance

Zero-shot detection evaluation

Fixed AP is reported on LVIS minival set with text (T) / visual (V) prompts.
Training time is for text prompts with detection based on 8 Nvidia RTX4090 GPUs.
FPS is measured on T4 with TensorRT and iPhone 12 with CoreML, respectively.
For training data, OG denotes Objects365v1 and GoldG.
YOLOE can become YOLOs after re-parameterization with zero inference and transferring overhead.

| Model | Size | Prompt | Params | Data | Time | FPS | $AP$ | $APr$ | $APc$ | $AP_f$ | Log | |---|---|---|---|---|---|---|---|---|---|---|---| | YOLOE-v8-S | 640 | T / V | 12M / 13M | OG | 12.0h | 305.8 / 64.3 | 27.9 / 26.2 | 22.3 / 21.3 | 27.8 / 27.7 | 29.0 / 25.7 | T / V | | YOLOE-v8-M | 640 | T / V | 27M / 30M | OG | 17.0h | 156.7 / 41.7 | 32.6 / 31.0 | 26.9 / 27.0 | 31.9 / 31.7 | 34.4 / 31.1 | T / V | | YOLOE-v8-L | 640 | T / V | 45M / 50M | OG | 22.5h | 102.5 / 27.2 | 35.9 / 34.2 | 33.2 / 33.2 | 34.8 / 34.6 | 37.3 / 34.1 | T / V | | YOLOE-11-S | 640 | T / V | 10M / 12M | OG | 13.0h | 301.2 / 73.3 | 27.5 / 26.3 | 21.4 / 22.5 | 26.8 / 27.1 | 29.3 / 26.4 | T / V | | YOLOE-11-M | 640 | T / V | 21M / 27M | OG | 18.5h | 168.3 / 39.2 | 33.0 / 31.4 | 26.9 / 27.1 | 32.5 / 31.9 | 34.5 / 31.7 | T / V | | YOLOE-11-L | 640 | T / V | 26M / 32M | OG | 23.5h | 130.5 / 35.1 | 35.2 / 33.7 | 29.1 / 28.1 | 35.0 / 34.6 | 36.5 / 33.8 | T / V |

Zero-shot segmentation evaluation

The model is the same as above in Zero-shot detection evaluation.
Standard AP^m is reported on LVIS val set with text (T) / visual (V) prompts.

| Model | Size | Prompt | $AP^m$ | $APr^m$ | $APc^m$ | $AP_f^m$ | |---|---|---|---|---|---|---| | YOLOE-v8-S | 640 | T / V | 17.7 / 16.8 | 15.5 / 13.5 | 16.3 / 16.7 | 20.3 / 18.2 | | YOLOE-v8-M | 640 | T / V | 20.8 / 20.3 | 17.2 / 17.0 | 19.2 / 20.1 | 24.2 / 22.0 | | YOLOE-v8-L | 640 | T / V | 23.5 / 22.0 | 21.9 / 16.5 | 21.6 / 22.1 | 26.4 / 24.3 | | YOLOE-11-S | 640 | T / V | 17.6 / 17.1 | 16.1 / 14.4 | 15.6 / 16.8 | 20.5 / 18.6 | | YOLOE-11-M | 640 | T / V | 21.1 / 21.0 | 17.2 / 18.3 | 19.6 / 20.6 | 24.4 / 22.6 | | YOLOE-11-L | 640 | T / V | 22.6 / 22.5 | 19.3 / 20.5 | 20.9 / 21.7 | 26.0 / 24.1 |

Prompt-free evaluation

The model is the same as above in Zero-shot detection evaluation except the specialized prompt embedding.
Fixed AP is reported on LVIS minival set and FPS is measured on Nvidia T4 GPU with Pytorch.

| Model | Size | Params | $AP$ | $APr$ | $APc$ | $AP_f$ | FPS | Log | |---|---|---|---|---|---|---|---|---| | YOLOE-v8-S | 640 | 13M | 21.0 | 19.1 | 21.3 | 21.0 | 95.8 | PF | | YOLOE-v8-M | 640 | 29M | 24.7 | 22.2 | 24.5 | 25.3 | 45.9 | PF | | YOLOE-v8-L | 640 | 47M | 27.2 | 23.5 | 27.0 | 28.0 | 25.3 | PF | | YOLOE-11-S | 640 | 11M | 20.6 | 18.4 | 20.2 | 21.3 | 93.0 | PF | | YOLOE-11-M | 640 | 24M | 25.5 | 21.6 | 25.5 | 26.1 | 42.5 | PF | | YOLOE-11-L | 640 | 29M | 26.3 | 22.7 | 25.8 | 27.5 | 34.9 | PF |

Downstream transfer on COCO

During transferring, YOLOE-v8 / YOLOE-11 is exactly the same as YOLOv8 / YOLO11.
For Linear probing, only the last conv in classification head is trainable.
For Full tuning, all parameters are trainable.

| Model | Size | Epochs | $AP^b$ | $AP^b{50}$ | $AP^b{75}$ | $AP^m$ | $AP^m{50}$ | $AP^m{75}$ | Log | |---|---|---|---|---|---|---|---|---|---| | Linear probing | | | | | | | | | | | YOLOE-v8-S | 640 | 10 | 35.6 | 51.5 | 38.9 | 30.3 | 48.2 | 32.0 | LP | | YOLOE-v8-M | 640 | 10 | 42.2 | 59.2 | 46.3 | 35.5 | 55.6 | 37.7 | LP | | YOLOE-v8-L | 640 | 10 | 45.4 | 63.3 | 50.0 | 38.3 | 59.6 | 40.8 | LP | | YOLOE-11-S | 640 | 10 | 37.0 | 52.9 | 40.4 | 31.5 | 49.7 | 33.5 | LP | | YOLOE-11-M | 640 | 10 | 43.1 | 60.6 | 47.4 | 36.5 | 56.9 | 39.0 | LP | | YOLOE-11-L | 640 | 10 | 45.1 | 62.8 | 49.5 | 38.0 | 59.2 | 40.6 | LP | | Full tuning | | | | | | | | | | | YOLOE-v8-S | 640 | 160 | 45.0 | 61.6 | 49.1 | 36.7 | 58.3 | 39.1 | FT | | YOLOE-v8-M | 640 | 80 | 50.4 | 67.0 | 55.2 | 40.9 | 63.7 | 43.5 | FT | | YOLOE-v8-L | 640 | 80 | 53.0 | 69.8 | 57.9 | 42.7 | 66.5 | 45.6 | FT | | YOLOE-11-S | 640 | 160 | 46.2 | 62.9 | 50.0 | 37.6 | 59.3 | 40.1 | FT | | YOLOE-11-M | 640 | 80 | 51.3 | 68.3 | 56.0 | 41.5 | 64.8 | 44.3 | FT | | YOLOE-11-L | 640 | 80 | 52.6 | 69.7 | 57.5 | 42.4 | 66.2 | 45.2 | FT |

Installation

You could also quickly try YOLOE for prediction and transferring using colab notebooks.

conda virtual environment is recommended. ```bash conda create -n yoloe python=3.10 -y conda activate yoloe

If you clone this repo, please use this

pip install -r requirements.txt

Or you can also directly install the repo by this

pip install git+https://github.com/THU-MIG/yoloe.git#subdirectory=thirdparty/CLIP pip install git+https://github.com/THU-MIG/yoloe.git#subdirectory=thirdparty/ml-mobileclip pip install git+https://github.com/THU-MIG/yoloe.git#subdirectory=third_party/lvis-api pip install git+https://github.com/THU-MIG/yoloe.git

wget https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_blt.pt ```

Demo

If desired objects are not identified, pleaset set a smaller confidence threshold, e.g., for visual prompts with handcrafted shape or cross-image prompts. ```bash

Optional for mirror: export HF_ENDPOINT=https://hf-mirror.com

pip install gradio==4.42.0 gradioimageprompter==0.1.0 fastapi==0.112.2 huggingface-hub==0.26.3 gradio_client==1.3.0 python app.py

Please visit http://127.0.0.1:7860

```

Prediction

```bash

Download pretrained models

Optional for mirror: export HF_ENDPOINT=https://hf-mirror.com

Please replace the pt file with your desired model

pip install huggingface-hub==0.26.3 huggingface-cli download jameslahm/yoloe yoloe-v8l-seg.pt --local-dir pretrain For yoloe-(v8s/m/l)/(11s/m/l)-seg, Models can also be automatically downloaded using `from_pretrained`.python from ultralytics import YOLOE model = YOLOE.from_pretrained("jameslahm/yoloe-v8l-seg") ```

Text prompt

bash python predict_text_prompt.py \ --source ultralytics/assets/bus.jpg \ --checkpoint pretrain/yoloe-v8l-seg.pt \ --names person dog cat \ --device cuda:0

Visual prompt

bash python predict_visual_prompt.py

Prompt free

bash python predict_prompt_free.py

Transferring

After pretraining, YOLOE-v8 / YOLOE-11 can be re-parameterized into the same architecture as YOLOv8 / YOLO11, with zero overhead for transferring.

Linear probing

Only the last conv, ie., the prompt embedding, is trainable. bash python train_pe.py

Full tuning

All parameters are trainable, for better performance. ```bash

For models with s scale, please change the epochs to 160 for longer training

python trainpeall.py ```

Validation

Data

Please download LVIS following here or lvis.yaml.
We use this minival.txt with background images for evaluation.

```bash

For evaluation with visual prompt, please obtain the referring data.

python tools/generatelvisvisualpromptdata.py ```

Zero-shot evaluation on LVIS

For text prompts, python val.py.
For visual prompts, python val_vp.py

For Fixed AP, please refer to the comments in val.py and val_vp.py, and use tools/eval_fixed_ap.py for evaluation.

Prompt-free evaluation

bash python val_pe_free.py python tools/eval_open_ended.py --json ../datasets/lvis/annotations/lvis_v1_minival.json --pred runs/detect/val/predictions.json --fixed

Downstream transfer on COCO

bash python val_coco.py

Training

The training includes three stages: - YOLOE is trained with text prompts for detection and segmentation for 30 epochs. - Only visual prompt encoder (SAVPE) is trained with visual prompts for 2 epochs. - Only specialized prompt embedding for prompt free is trained for 1 epochs.

Data

| Images | Raw Annotations | Processed Annotations | |---|---|---| | Objects365v1 | objects365_train.json | objects365trainsegm.json | | GQA | finalmixedtrainnoococo.json | finalmixedtrainnoococo_segm.json | | Flickr30k | finalflickrseparateGT_train.json | finalflickrseparateGTtrainsegm.json |

For annotations, you can directly use our preprocessed ones or use the following script to obtain the processed annotations with segmentation masks. ```bash

Generate segmentation data

conda create -n sam2 python==3.10.16 conda activate sam2 pip install -r thirdparty/sam2/requirements.txt pip install -e thirdparty/sam2/

python tools/generatesammasks.py --img-path ../datasets/Objects365v1/images/train --json-path ../datasets/Objects365v1/annotations/objects365train.json --batch python tools/generatesammasks.py --img-path ../datasets/flickr/fullimages/ --json-path ../datasets/flickr/annotations/finalflickrseparateGTtrain.json python tools/generatesammasks.py --img-path ../datasets/mixedgrounding/gqa/images --json-path ../datasets/mixedgrounding/annotations/finalmixedtrainno_coco.json

Generate objects365v1 labels

python tools/generate_objects365v1.py ```

Then, please generate the data and embedding cache for training. ```bash

Generate grounding segmentation cache

python tools/generategroundingcache.py --img-path ../datasets/flickr/fullimages/ --json-path ../datasets/flickr/annotations/finalflickrseparateGTtrainsegm.json python tools/generategroundingcache.py --img-path ../datasets/mixedgrounding/gqa/images --json-path ../datasets/mixedgrounding/annotations/finalmixedtrainnococosegm.json

Generate train label embeddings

python tools/generatelabelembedding.py python tools/generateglobalnegcat.py At last, please download MobileCLIP-B(LT) for text encoder.bash wget https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclipblt.pt ```

Text prompt

```bash

For models with l scale, please change the initialization by referring to the comments in Line 549 in ultralytics/nn/moduels/head.py

If you want to train YOLOE only for detection, you can use `train.py`

python train_seg.py ```

Visual prompt

```bash

For visual prompt, because only SAVPE is trained, we can adopt the detection pipeline with less training time

First, obtain the detection model

python tools/convert_segm2det.py

Then, train the SAVPE module

python train_vp.py

After training, please use tools/getvpsegm.py to add the segmentation head

python tools/getvpsegm.py

```

Prompt free

```bash

Generate LVIS with single class for evaluation during training

python tools/generatelvissc.py

Similar to visual prompt, because only the specialized prompt embedding is trained, we can adopt the detection pipeline with less training time

python tools/convertsegm2det.py python trainpe_free.py

After training, please use tools/getpffree_segm.py to add the segmentation head

python tools/getpffree_segm.py

```

Export

After re-parameterization, YOLOE-v8 / YOLOE-11 can be exported into the identical format as YOLOv8 / YOLO11, with zero overhead for inference. bash pip install onnx coremltools onnxslim python export.py

Benchmark

For TensorRT, please refer to benchmark.sh.
For CoreML, please use the benchmark tool from XCode 14.
For prompt-free setting, please refer to tools/benchmark_pf.py.

Acknowledgement

The code base is built with ultralytics, YOLO-World, MobileCLIP, lvis-api, CLIP, and GenerateU.

Thanks for the great implementations!

Owner

Name: Anthony Nila Cadenas
Login: anthony-nila
Kind: user

Repositories: 1
Profile: https://github.com/anthony-nila

I am a Systems Engineer with experience as a Fullstack developer, specialized in the design, development and implementation of advanced technological solutions.

Citation (CITATION.cff)

# This CITATION.cff file was generated with https://bit.ly/cffinit

cff-version: 1.2.0
title: Ultralytics YOLO
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
license: AGPL-3.0
version: 8.0.0
date-released: '2023-01-10'

GitHub Events

Total

Push event: 2
Pull request event: 1
Create event: 3

Last Year

Push event: 2
Pull request event: 1
Create event: 3

Dependencies

examples/YOLO-Series-ONNXRuntime-Rust/Cargo.toml cargo

examples/YOLOv8-ONNXRuntime-Rust/Cargo.toml cargo

docker/Dockerfile docker

pytorch/pytorch 2.5.0-cuda12.4-cudnn9-runtime build

examples/YOLOv8-Action-Recognition/requirements.txt pypi

transformers *
ultralytics *

pyproject.toml pypi

matplotlib >=3.3.0
numpy <2.0.0; sys_platform == 'darwin'
numpy >=1.23.0
opencv-python >=4.6.0
pandas >=1.1.4
pillow >=7.1.2
psutil *
py-cpuinfo *
pyyaml >=5.3.1
requests >=2.23.0
scipy >=1.4.1
seaborn >=0.11.0
supervision >=0.25.1
torch >=1.8.0,!=2.4.0; sys_platform == 'win32'
torch >=1.8.0
torchvision >=0.9.0
tqdm >=4.64.0
ultralytics-thop >=2.0.0

requirements.txt pypi

third_party/CLIP/requirements.txt pypi

ftfy *
regex *
torch *
torchvision *
tqdm *

third_party/CLIP/setup.py pypi

for *
str *

third_party/lvis-api/requirements.txt pypi

Cython >=0.29.12
cycler >=0.10.0
kiwisolver >=1.1.0
matplotlib >=3.1.1
numpy >=1.18.2
opencv-python >=4.1.0.25
pyparsing >=2.4.0
python-dateutil >=2.8.0
six >=1.12.0

third_party/lvis-api/setup.py pypi

third_party/ml-mobileclip/requirements.txt pypi

open-clip-torch >=2.20.0
timm ==0.9.5
torch >=2.1.0

third_party/ml-mobileclip/setup.py pypi

third_party/sam2/pyproject.toml pypi

third_party/sam2/requirements.txt pypi

Jinja2 ==3.1.4
MarkupSafe ==3.0.2
PyYAML ==6.0.2
antlr4-python3-runtime ==4.9.3
blessed ==1.20.0
contourpy ==1.3.1
cycler ==0.12.1
filelock ==3.16.1
fonttools ==4.55.3
fsspec ==2024.10.0
gpustat ==1.1.1
hydra-core ==1.3.2
iopath ==0.1.10
kiwisolver ==1.4.7
matplotlib ==3.9.4
mpmath ==1.3.0
networkx ==3.4.2
numpy ==1.26.4
nvidia-cublas-cu12 ==12.4.5.8
nvidia-cuda-cupti-cu12 ==12.4.127
nvidia-cuda-nvrtc-cu12 ==12.4.127
nvidia-cuda-runtime-cu12 ==12.4.127
nvidia-cudnn-cu12 ==9.1.0.70
nvidia-cufft-cu12 ==11.2.1.3
nvidia-curand-cu12 ==10.3.5.147
nvidia-cusolver-cu12 ==11.6.1.9
nvidia-cusparse-cu12 ==12.3.1.170
nvidia-ml-py ==12.560.30
nvidia-nccl-cu12 ==2.21.5
nvidia-nvjitlink-cu12 ==12.4.127
nvidia-nvtx-cu12 ==12.4.127
omegaconf ==2.3.0
opencv-python ==4.10.0.84
packaging ==24.2
pillow ==11.0.0
portalocker ==3.0.0
psutil ==6.1.0
pycocotools ==2.0.8
pyparsing ==3.2.0
python-dateutil ==2.9.0.post0
sam2 ==1.1.0
six ==1.17.0
sympy ==1.13.1
torch ==2.5.1
torchaudio ==2.5.1
torchvision ==0.20.1
tqdm ==4.67.1
triton ==3.1.0
typing_extensions ==4.12.2
wcwidth ==0.2.13

third_party/sam2/sav_dataset/requirements.txt pypi

matplotlib *
numpy *
opencv-python *
pillow *
pycocoevalcap *
scikit-image *
tqdm *

third_party/sam2/setup.py pypi

yoloe

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

YOLOE: Real-Time Seeing Anything

Performance

Zero-shot detection evaluation

Zero-shot segmentation evaluation

Prompt-free evaluation

Downstream transfer on COCO

Installation

If you clone this repo, please use this

Or you can also directly install the repo by this

Demo

Optional for mirror: export HF_ENDPOINT=https://hf-mirror.com

Please visit http://127.0.0.1:7860

Prediction

Download pretrained models

Optional for mirror: export HF_ENDPOINT=https://hf-mirror.com

Please replace the pt file with your desired model

Text prompt

Visual prompt

Prompt free

Transferring

Linear probing

Full tuning

For models with s scale, please change the epochs to 160 for longer training

Validation

Data

For evaluation with visual prompt, please obtain the referring data.

Zero-shot evaluation on LVIS

Prompt-free evaluation

Downstream transfer on COCO

Training

Data

Generate segmentation data

Generate objects365v1 labels

Generate grounding segmentation cache

Generate train label embeddings

Text prompt

For models with l scale, please change the initialization by referring to the comments in Line 549 in ultralytics/nn/moduels/head.py

If you want to train YOLOE only for detection, you can use train.py

Visual prompt

For visual prompt, because only SAVPE is trained, we can adopt the detection pipeline with less training time

First, obtain the detection model

Then, train the SAVPE module

After training, please use tools/getvpsegm.py to add the segmentation head

python tools/getvpsegm.py

Prompt free

Generate LVIS with single class for evaluation during training

Similar to visual prompt, because only the specialized prompt embedding is trained, we can adopt the detection pipeline with less training time

After training, please use tools/getpffree_segm.py to add the segmentation head

python tools/getpffree_segm.py

Export

Benchmark

Acknowledgement

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

If you want to train YOLOE only for detection, you can use `train.py`