https://github.com/aim-uofa/omni-r1

Official Repo of Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary

Keywords

grpo mllms omnimodal rl

Last synced: 5 months ago · JSON representation

Repository

Official Repo of Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Basic Info

Host: GitHub
Owner: aim-uofa
Language: Python
Default Branch: main
Homepage: https://aim-uofa.github.io/OmniR1/
Size: 165 MB

Statistics

Stars: 63
Watchers: 3
Forks: 3
Open Issues: 2
Releases: 0

Topics

grpo mllms omnimodal rl

Created 9 months ago · Last pushed 9 months ago

Metadata Files

Readme

# Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration Hao Zhong^\*, [Muzhi Zhu](https://scholar.google.com/citations?user=064gBH4AAAAJ&hl=zh-CN&oi=ao)^*, Zongze Du^\*, Zheng Huang, [Canyu Zhao](https://github.com/volcverse), [Mingyu Liu](https://mingyulau.github.io/), [Wen Wang](https://github.com/encounter1997), [Hao Chen](https://scholar.google.com/citations?user=FaOqRpcAAAAJ), [Chunhua Shen](https://cshen.github.io) [Zhejiang University](https://www.zju.edu.cn/english/) *Equal contribution [ **Paper**](https://arxiv.org/abs/2505.20256) | [ **Project Page**](https://aim-uofa.github.io/OmniR1/) | [ **Model Weights**](https://www.modelscope.cn/models/jxzh2020/Omni-R1) | [ **Model Weights**](https://huggingface.co/Haoz0206/Omni-R1)

Overview

Description

Video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on multimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because optimal keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement-learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination.

Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

Plan

[X] Release model weights and demo.
[ ] Release the segmentation and evaluation code.
[ ] Release the training scripts.

Getting Started

Set up Environment

```bash git clone https://github.com/aim-uofa/Omni-R1 cd Omni-R1

build environment

conda create -n omni python=3.10 conda activate omni

install packages

pip install -r requirements.txt pip install -e src/qwen-omni-utils[decord] pip install flash-attn --no-build-isolation pip install transformers/transformers_omni.zip

replace transformers Qwen2.5-Omni .py file

bash replace_omni.sh ```

This project also supports uv, if preferred,

```bash uv sync --no-build-isolation-package flash-attn source .venv/bin/activate

replace transformers Qwen2.5-Omni .py file

bash replace_omni.sh ```

Download Datasets

Download and extract the datasets you need and prepare a src/r1-v/datasets.json according to src/r1-v/datasets_demo.json.

ReVOS and MeVIS datasets are directly selected from Sa2VA training dataset, which can be downloaded here. Please refer to Sa2VA for usage.
refCOCOg2k840 from SegZero can be downloaded here.
RefAVS can be downloaded here

Training

```bash

for uv, source .venv/bin/activate

conda activate omni

start SAM server first. If not training VOS or alpha_g is set to 0.0, then SAM server is not necessary.

bash src/scripts/runsamserver.sh

start training, by default this script does not need a SAM server.

bash src/scripts/omnir1runtraining.sh ``To connect to an existing SAM server, you can set upSAMHOSTandSAMPORTas environment variables insrc/scripts/omnir1runtraining.sh`.

Inference

Inference code and evaluation code coming soon.

```python import torch from transformers import ( Qwen25OmniModel, Qwen25OmniProcessor, GenerationConfig, Qwen25OmniThinkerForConditionalGeneration, ) from transformers import AutoModelForCausalLM, AutoTokenizer from qwenomniutils import processmminfo, processvision_info

omni_path = "/path/to/Omni-R1"

Omni-R1 is Qwen25OmniThinker, not Qwen25OmniModel, so inference code is different from that of Qwen official codes.

model = Qwen25OmniThinkerForConditionalGeneration.frompretrained( omnipath, devicemap="auto", torchdtype=torch.bfloat16, attnimplementation="flashattention2", ).eval() processor = Qwen25OmniProcessor.frompretrained(omni_path)

generationconfig = GenerationConfig( usecache=True, maxnewtokens=1024, do_sample=False )

def inference(videopath, prompt, sysprompt): messages = [ {"role": "system", "content": [{"type": "text", "text": sysprompt}]}, { "role": "user", "content": [ {"type": "video", "video": videopath}, {"type": "text", "text": prompt}, ], }, ] textinput = processor.applychattemplate( messages, tokenize=False, addgeneration_prompt=True )

audio_input, image_input, video_input, process_args = process_mm_info(
    messages, use_audio_in_video=False
)

inputs = processor(
    text=text_input,
    images=image_input,
    audios=audio_input,
    videos=video_input,
    return_tensors="pt",
    do_resize=True,
)

# 
with torch.inference_mode():
    generated_ids = model.generate(**inputs, generation_config=generation_config)

prompt_length = inputs["input_ids"].size(1)
completion_ids = generated_ids[:, prompt_length:]
# Decode the generated completions
text = processor.batch_decode(completion_ids, skip_special_tokens=True)
return text

video_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/shopping.mp4" prompt = "How many kind of drinks can you see in the video?"

Use a local model to inference.

response = inference( videopath, prompt=prompt, sysprompt="You are a helpful assistant." ) print(response[0])

```

License

For academic usage, this project is licensed under the 2-clause BSD License. For commercial inquiries, please contact Chunhua Shen.

Acknowledgements

We sincerely appreciate the contributions of the open-source community. The related projects are as follows: Sa2VA, Video-R1, R1-V , DeepSeek-R1

Citation

If you find this work helpful for your research, please cite:

BibTeX @article{zhong2025omni, title={Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration}, author={Zhong, Hao and Zhu, Muzhi and Du, Zongze and Huang, Zheng and Zhao, Canyu and Liu, Mingyu and Wang, Wen and Chen, Hao and Shen, Chunhua}, journal={arXiv preprint arXiv:2505.20256}, year={2025} }

Owner

Name: Advanced Intelligent Machines (AIM)
Login: aim-uofa
Kind: organization
Location: China

Repositories: 23
Profile: https://github.com/aim-uofa

A research team at Zhejiang University, focusing on Computer Vision and broad AI research ...

GitHub Events

Total

Issues event: 4
Watch event: 45
Issue comment event: 3
Public event: 1
Push event: 3
Pull request review event: 1
Pull request event: 2
Fork event: 2

Last Year

Issues event: 4
Watch event: 45
Issue comment event: 3
Public event: 1
Push event: 3
Pull request review event: 1
Pull request event: 2
Fork event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 5
Total pull requests: 2
Average time to close issues: about 23 hours
Average time to close pull requests: 1 day
Total issue authors: 5
Total pull request authors: 1
Average comments per issue: 0.2
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 5
Pull requests: 2
Average time to close issues: about 23 hours
Average time to close pull requests: 1 day
Issue authors: 5
Pull request authors: 1
Average comments per issue: 0.2
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

gak97 (1)
VoyageWang (1)
AlexiaJM (1)
HPUhushicheng (1)
CJack812 (1)

Pull Request Authors

erjanmx (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

pyproject.toml pypi

accelerate >=1.2.1
bitsandbytes >=0.43.0
black >=24.4.2
datasets >=3.2.0
deepspeed >=0.15.4
diskcache >=5.6.3
einops >=0.8.0
flake8 >=6.0.0
flash-attn >=2.7.4.post1
hf-transfer >=0.1.4
huggingface-hub [cli]>=0.19.2
isort >=5.12.0
liger-kernel >=0.5.2
numpy >=1.26.4
opencv-python >=4.9.0.80
packaging >=23.0
parameterized >=0.9.0
peft >=0.12.0
pillow >=10.2.0
pycocotools >=2.0.7
qwen-omni-utils [decord]
rouge-score >=0.1.2
ruff >=0.11.10
safetensors >=0.3.3
sam2 >=1.1.0
scikit-image >=0.23.0
scipy >=1.13.0
sentencepiece >=0.1.99
tensorboardx >=2.6.2
torch ==2.6.0
torchaudio ==2.6.0
torchvision >=0.20.0
transformers *
trl >=0.14.0
wandb >=0.18.3

requirements.txt pypi

Pillow >=10.2.0
accelerate >=1.2.1
bitsandbytes >=0.43.0
black >=24.4.2
datasets >=3.2.0
deepspeed >=0.15.4
diskcache >=5.6.3
einops >=0.8.0
flake8 >=6.0.0
hf_transfer >=0.1.4
huggingface-hub >=0.19.2
isort >=5.12.0
liger_kernel >=0.5.2
numpy >=1.26.4
opencv-python >=4.9.0.80
packaging >=23.0
parameterized >=0.9.0
peft >=0.12.0
pycocotools >=2.0.7
rouge_score >=0.1.2
safetensors >=0.3.3
sam2 >=1.1.0
scikit-image >=0.23.0
scipy >=1.13.0
sentencepiece >=0.1.99
tensorboardx >=2.6.2
torch ==2.6.0
torchaudio ==2.6.0
torchvision >=0.17.2
trl >=0.14.0
wandb >=0.18.3

src/qwen-omni-utils/pyproject.toml pypi

av *
librosa *
packaging *
pillow *
requests *

uv.lock pypi

156 dependencies

https://github.com/aim-uofa/omni-r1

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Overview

Description

Plan

Getting Started

Set up Environment

build environment

install packages

replace transformers Qwen2.5-Omni .py file

replace transformers Qwen2.5-Omni .py file

Download Datasets

Training

for uv, source .venv/bin/activate

start SAM server first. If not training VOS or alpha_g is set to 0.0, then SAM server is not necessary.

start training, by default this script does not need a SAM server.

Inference

Omni-R1 is Qwen25OmniThinker, not Qwen25OmniModel, so inference code is different from that of Qwen official codes.

Use a local model to inference.

License

Acknowledgements

Citation

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies