https://github.com/aim-uofa/segagent

[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary

Keywords

agent mllms segment-anything vlms

Last synced: 10 months ago · JSON representation

Repository

[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Basic Info

Host: GitHub
Owner: aim-uofa
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 53.4 MB

Statistics

Stars: 62
Watchers: 7
Forks: 1
Open Issues: 1
Releases: 0

Topics

agent mllms segment-anything vlms

Created over 1 year ago · Last pushed 11 months ago

Metadata Files

Readme License

# 🎯 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories [Muzhi Zhu](https://scholar.google.com/citations?user=064gBH4AAAAJ&hl=en)^1,2, Yuzhuo Tian¹, [Hao Chen](https://stan-haochen.github.io/)^1*, Chunluan Zhou², Qingpei Guo^2*, Yang Liu¹, Ming Yang², [Chunhua Shen](https://cshen.github.io/)^1* ¹[Zhejiang University](https://www.zju.edu.cn/english/), ²[Ant Group](https://www.antgroup.com/en) **CVPR2025** [📄 **Paper**](https://arxiv.org/abs/2503.08625) | [🌐 **Project Page**](https://aim-uofa.github.io/SegAgent/) | [🤖 **Model Weight**](https://www.modelscope.cn/models/zzzmmz/SegAgent-Model) | [📊 **Data**](https://www.modelscope.cn/models/zzzmmz/SegAgent-Dataset)

🚀 Overview

📖 Description

Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities in understanding images but still struggle with pixel-level tasks like segmentation. SegAgent addresses this by introducing a novel Human-Like Mask Annotation Task (HLMAT), enabling MLLMs to mimic the annotation trajectories of human experts using interactive segmentation tools.

SegAgent effectively leverages these annotation trajectories without requiring architectural modifications or additional implicit tokens. Our approach significantly enhances MLLMs' segmentation and mask refinement abilities, establishing a new paradigm for assessing fine-grained visual understanding and multi-step reasoning.

🚩 Plan

✅ Release the weights.
✅ Release the inference code.
✅ Release the trajectory data for training and evaluation.

🚀 Getting Started

bash pip install -r env.txt

🤖 Inference

You can run inference on the validation or test set using the trained model and the provided script:

bash bash run_eval.sh /path/to/your/trained_model

This will run inference with SimpleClick as the segmentation model and SegAgent as the language grounding model. The script processes images and saves the predictions to the output directory.

To evaluate the results, run:

bash python eval_result_iou.py --input_json ./results/refcoco+_val_predictions.json

📄 For more details, refer to ./evaltools/eval.md.

🧑‍🏫 Training

SegAgent is trained using Human-Like Mask Annotation Trajectories (HLMAT). Follow the steps below to launch the training process:

Step 1: Prepare the Data

Ensure that the annotation trajectory data is preprocessed and saved in the appropriate format (e.g., COCO-style JSON files + click sequences).

We have uploaded the preprocessed trajectory data here:
📁 SegAgent-Data

Example structure:

bash tree ./data/segagent-data ├── refcoco_train.json ├── refcoco_val.json ├── refcoco+_train.json ├── ...

Additional image data sources: - RefCOCO image datasets: LISA GitHub Repository - HQ segmentation (SAM-HQ): Hugging Face SAM-HQ Data

Step 2: Run Training

We recommend converting the trajectory data into a format supported by LLaMA-Factory, and training using their framework directly.

🎫 License

For academic usage, this project is licensed under the 2-clause BSD License. For commercial inquiries, please contact Chunhua Shen.

🖊️ Citation

If you find this work helpful for your research, please cite:

```BibTeX @article{zhu2025segagent, title={SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories}, author={Zhu, Muzhi and Tian, Yuzhuo and Chen, Hao and Zhou, Chunluan and Guo, Qingpei and Liu, Yang and Yang, Ming and Shen, Chunhua}, journal={arXiv preprint arXiv:2503.08625}, year={2025}, url={https://arxiv.org/abs/2503.08625} }

Owner

Name: Advanced Intelligent Machines (AIM)
Login: aim-uofa
Kind: organization
Location: China

Repositories: 23
Profile: https://github.com/aim-uofa

A research team at Zhejiang University, focusing on Computer Vision and broad AI research ...

GitHub Events

Total

Issues event: 3
Watch event: 51
Issue comment event: 5
Push event: 2
Public event: 1
Fork event: 1

Last Year

Issues event: 3
Watch event: 51
Issue comment event: 5
Push event: 2
Public event: 1
Fork event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 4
Total pull requests: 0
Average time to close issues: 14 days
Average time to close pull requests: N/A
Total issue authors: 4
Total pull request authors: 0
Average comments per issue: 0.75
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 0
Average time to close issues: 14 days
Average time to close pull requests: N/A
Issue authors: 4
Pull request authors: 0
Average comments per issue: 0.75
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/aim-uofa/segagent

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

🚀 Overview

📖 Description

🚩 Plan

🚀 Getting Started

🤖 Inference

🧑‍🏫 Training

Step 1: Prepare the Data

Step 2: Run Training

🎫 License

🖊️ Citation

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels