https://github.com/aim-uofa/active-o3
ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.6%) to scientific vocabulary
Keywords
Repository
ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
Basic Info
- Host: GitHub
- Owner: aim-uofa
- Default Branch: main
- Homepage: https://aim-uofa.github.io/ACTIVE-o3/
- Size: 4.86 MB
Statistics
- Stars: 59
- Watchers: 2
- Forks: 1
- Open Issues: 2
- Releases: 0
Topics
Metadata Files
README.md
🚀 Overview
📖 Description
we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks—such as small-object and dense object grounding—and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.
🚩 Plan
- [x] Release the weights.
- [x] Release the inference demo.
- [ ] Release the dataset.
- [ ] Release the training scripts.
- [ ] Release the evaluation scripts. <!-- --- -->
🛠️ Getting Started
📐 Set up Environment
```bash
build environment
conda create -n activeo3 python=3.10 conda activate activeo3
install packages
pip install torch==2.5.1 torchvision==0.20.1 pip install flash-attn --no-build-isolation pip install transformers==4.51.3 pip install qwen-omni-utils[decord] ```
🔍 demo
```bash
run demo
python demo/activeo3demovstar.py ```
🎫 License
For academic usage, this project is licensed under the 2-clause BSD License. For commercial inquiries, please contact Chunhua Shen.
🖊️ Citation
If you find this work helpful for your research, please cite:
```BibTeX @article{zhu2025active, title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO}, author={Zhu, Muzhi and Zhong, Hao and Zhao, Canyu and Du, Zongze and Huang, Zheng and Liu, Mingyu and Chen, Hao and Zou, Cheng and Chen, Jingdong and Yang, Ming and others}, journal={arXiv preprint arXiv:2505.21457}, year={2025} }
Owner
- Name: Advanced Intelligent Machines (AIM)
- Login: aim-uofa
- Kind: organization
- Location: China
- Repositories: 23
- Profile: https://github.com/aim-uofa
A research team at Zhejiang University, focusing on Computer Vision and broad AI research ...
GitHub Events
Total
- Issues event: 2
- Watch event: 43
- Issue comment event: 2
- Push event: 1
- Public event: 1
Last Year
- Issues event: 2
- Watch event: 43
- Issue comment event: 2
- Push event: 1
- Public event: 1
Issues and Pull Requests
Last synced: 5 months ago
All Time
- Total issues: 4
- Total pull requests: 0
- Average time to close issues: about 18 hours
- Average time to close pull requests: N/A
- Total issue authors: 4
- Total pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 0
- Average time to close issues: about 18 hours
- Average time to close pull requests: N/A
- Issue authors: 4
- Pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- NielsRogge (1)
- litingsjj (1)
- HansenJohn (1)
- caichuang0415 (1)