https://github.com/aim-uofa/active-o3

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

https://github.com/aim-uofa/active-o3

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.6%) to scientific vocabulary

Keywords

active-perception active-vision grpo mllms o3 rl thinking-with-image
Last synced: 5 months ago · JSON representation

Repository

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Basic Info
Statistics
  • Stars: 59
  • Watchers: 2
  • Forks: 1
  • Open Issues: 2
  • Releases: 0
Topics
active-perception active-vision grpo mllms o3 rl thinking-with-image
Created 9 months ago · Last pushed 9 months ago
Metadata Files
Readme

README.md

# ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO 1[Zhejiang University](https://www.zju.edu.cn/english/),   2[Ant Group](https://www.antgroup.com/en) [📄 **Paper**](https://arxiv.org/abs/2505.21457)  |  [🌐 **Project Page**](https://aim-uofa.github.io/ACTIVE-o3)  |  [💾 **Model Weights**](https://www.modelscope.cn/models/zzzmmz/ACTIVE-o3)

🚀 Overview

SegAgent Framework

📖 Description

we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks—such as small-object and dense object grounding—and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.

🚩 Plan

  • [x] Release the weights.
  • [x] Release the inference demo.
  • [ ] Release the dataset.
  • [ ] Release the training scripts.
  • [ ] Release the evaluation scripts. <!-- --- -->

🛠️ Getting Started

📐 Set up Environment

```bash

build environment

conda create -n activeo3 python=3.10 conda activate activeo3

install packages

pip install torch==2.5.1 torchvision==0.20.1 pip install flash-attn --no-build-isolation pip install transformers==4.51.3 pip install qwen-omni-utils[decord] ```

🔍 demo

```bash

run demo

python demo/activeo3demovstar.py ```

🎫 License

For academic usage, this project is licensed under the 2-clause BSD License. For commercial inquiries, please contact Chunhua Shen.

🖊️ Citation

If you find this work helpful for your research, please cite:

```BibTeX @article{zhu2025active, title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO}, author={Zhu, Muzhi and Zhong, Hao and Zhao, Canyu and Du, Zongze and Huang, Zheng and Liu, Mingyu and Chen, Hao and Zou, Cheng and Chen, Jingdong and Yang, Ming and others}, journal={arXiv preprint arXiv:2505.21457}, year={2025} }

Owner

  • Name: Advanced Intelligent Machines (AIM)
  • Login: aim-uofa
  • Kind: organization
  • Location: China

A research team at Zhejiang University, focusing on Computer Vision and broad AI research ...

GitHub Events

Total
  • Issues event: 2
  • Watch event: 43
  • Issue comment event: 2
  • Push event: 1
  • Public event: 1
Last Year
  • Issues event: 2
  • Watch event: 43
  • Issue comment event: 2
  • Push event: 1
  • Public event: 1

Issues and Pull Requests

Last synced: 5 months ago

All Time
  • Total issues: 4
  • Total pull requests: 0
  • Average time to close issues: about 18 hours
  • Average time to close pull requests: N/A
  • Total issue authors: 4
  • Total pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 0
  • Average time to close issues: about 18 hours
  • Average time to close pull requests: N/A
  • Issue authors: 4
  • Pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • NielsRogge (1)
  • litingsjj (1)
  • HansenJohn (1)
  • caichuang0415 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels