https://github.com/buaadreamer/r1-v
Witness the aha moment of VLM with less than $3.
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org, scholar.google -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Witness the aha moment of VLM with less than $3.
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of Deep-Agent/R1-V
Created over 1 year ago
· Last pushed over 1 year ago
https://github.com/BUAADreamer/R1-V/blob/main/
# R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3  > ### Roadmap for R1-V > We are building a general framework for RLVR in VLM. We believe in the power of **trenches** and **longtermism**. > > Our Interest: General Vision-Language Intelligence & Visual/GUI Agent > > Our Goal: Algorithm Enhancement Efficiency Optimization Task Diversity Impactful Open Source Research. > > Welcome Ideas and Contribution. Stay tuned! 1. We firstly reveal that **Reinforcement Learning with Verifiable Rewards (RLVR)** outperforms chain-of-thought supervised fine-tuning (CoT-SFT) in both **effectiveness and out-of-distribution (OOD) robustness** for vision language models. 2. In our experiment, we **incentivize** VLMs to learn **generalizable** visual counting abilities, rather than overfitting to the training set. 3. The 2B model outperforms the 72B model in OOD tests within just **100** training steps. 4. The training was conducted on 8 A100 GPUs for **30 minutes, costing $2.62**. **Resources:** [ R1V Training Dataset: CLEVR-70k-Counting](https://huggingface.co/datasets/leonardPKU/clevr_cogen_a_train) [ R1V Training Dataset: CLEVR-70k-Complex](https://huggingface.co/datasets/MMInstruction/Clevr_CoGenT_TrainA_70K_Complex) [ R1V Training Dataset: GEOQA-8k](https://huggingface.co/datasets/leonardPKU/GEOQA_R1V_Train_8K) [ R1-Distilled Visual Reasoning Dataset](https://huggingface.co/datasets/MMInstruction/Clevr_CoGenT_TrainA_R1) **R1-V Team:** [Liang Chen](https://github.com/chenllliang) [Lei Li](https://lilei-nlp.github.io) [Haozhe Zhao](https://haozhezhao.github.io/) [Yifan Song](https://github.com/Yifan-Song793) [Vinci](https://github.com/0xvincii) [Zihao Yue](https://yuezih.github.io/) **Contributors**:--- ### Updates - 2025-02-12: We fixed the batched decoding error. The orignial RL training scirpt now is 3x speeded up. - 2025-02-12: R1-V now supports vLLM to accelerate training (`pip install vllm==0.7.2` before use) and SFT. - 2025-02-11: R1-V now supports Qwen2.5-VL and [GEOQA](https://arxiv.org/abs/2312.11370) task. - 2025-02-06: We upload the evaluation script and polish the README. We are writing a blog post summarizing the statistics, findings and underexplored questions. - 2025-02-03: We upload the training codebase. - 2025-02-03: We curate and upload some verified Deepseek-R1 visual reasoning traces with some special tricks (see `R1-V/src/distill_r1/`). Current training code does not rely on it, feel free to explore. - 2025-02-03: We release the R1-V repo. ### For contributors - Our top development priority is addressing the issues marked with `help wanted` labels, and we welcome ideas/PRs from the community to help solve them. ---     ## Setup ```bash conda create -n r1-v python=3.11 conda activate r1-v bash setup.sh ``` ### Supported Models 1. Qwen2-VL 2. Qwen2.5-VL ### Supported Training Datasets 1. [ R1V Training Dataset: CLEVR-70k-Counting](https://huggingface.co/datasets/leonardPKU/clevr_cogen_a_train): Item Counting Problems 2. [ R1V Training Dataset: CLEVR-70k-Complex](https://huggingface.co/datasets/MMInstruction/Clevr_CoGenT_TrainA_70K_Complex): Number Related Reasoning 3. [ R1V Training Dataset: GEOQA-8k](https://huggingface.co/datasets/leonardPKU/GEOQA_R1V_Train_8K): Geometry Reasoning ### Supported Evaluations 1. [SuperClevr-200](https://github.com/Deep-Agent/R1-V?tab=readme-ov-file#superclevr): Item Counting Problems 2. [GeoQA-Test-Direct-Answer-735](https://github.com/Deep-Agent/R1-V?tab=readme-ov-file#geoqa): Geometry Reasoning ## Training ### GRPO ```bash cd src/r1-v export DEBUG_MODE="true" # Enable Debug if you want to see the rollout of model during RL export LOG_PATH="./debug_log_2b.txt" torchrun --nproc_per_node="8" \ --nnodes="1" \ --node_rank="0" \ --master_addr="127.0.0.1" \ --master_port="12345" \ src/open_r1/grpo.py \ --output_dir
\ --model_name_or_path \ --dataset_name leonardPKU/clevr_cogen_a_train \ --deepspeed local_scripts/zero3.json \ --max_prompt_length 512 \ --max_completion_length 512 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 2 \ --logging_steps 1 \ --bf16 \ --report_to wandb \ --gradient_checkpointing false \ --attn_implementation flash_attention_2 \ --max_pixels 401408 \ --num_train_epochs 2 \ --run_name Qwen2-VL-2B-GRPO-CLEVR-70k \ --save_steps 100 \ --save_only_model true \ --num_generations 8 # number of outputs G in grpo, reduce it would lead to faster training and smaller memory cost but higher variance ``` > [!NOTE] > 1. To reproduce the result, keep the per_device_train_batch_size to 1 for now, as there is a revealed bug about batched training. See the [reproduction report](https://github.com/Deep-Agent/R1-V/issues/4#issuecomment-2633348354) here. We realize it is important for effiency and are working on solving it with the community. > 2. If you meet **OOM Error**, you can try reduce `--num_generations` > 3. To use vLLM to speed up, please refer to this [script](https://github.com/Deep-Agent/R1-V/blob/main/src/scripts/run_grpo_vllm.sh), currently it only supports Qwen2VL model series. ### SFT We also provide SFT code, please follow the script and edit the config to customize the sft task. ```bash accelerate launch --config_file src/r1-v/configs/zero2.yaml src/r1-v/src/open_r1/sft.py --config src/r1-v/configs/qwen2vl_sft_config.yaml ``` ## Evaluation ### SuperCLEVR  We provide the example script to evaluate OOD counting performance on a subset of SuperCLEVR within 1 minute. You can also modify the script and dataset to test on your own dataset. ```bash cd ./src/eval wget https://www.cs.jhu.edu/~zhuowan/zhuowan/SuperCLEVR/to_be_released/images.zip unzip images.zip # change the model path in the script python test_qwen2vl_counting_superclevr.py # tested scores: # Qwen2VL-2B-Instruct: 48.0% # Qwen2VL-2B-Instruct-GRPO-100step: 82.5% ``` ### GEOQA ![]()
We provide the example script to evaluate on the test set (direct answer form) of [GEOQA](https://arxiv.org/abs/2312.11370). ```bash # prepare images for testing cd ./src/eval git lfs install git clone https://huggingface.co/datasets/Luckyjhg/Geo170K cd Geo170K unzip images.zip # Evaluation Script python test_qwen2vl_geoqa.py # tested scores: # Qwen2VL-7B-Instruct: 30.63% # Qwen2VL-7B-Instruct-GRPO-2epochs: 38.72% # Qwen2.5VL-3B-Instruct: 35.41% # Qwen2.5VL-3B-Instruct-GRPO-1epochs: 47.48% ``` To enable faster inference with multiple GPUs, you could also use the script in `R1-V/src/scripts/test_grpo_geoqa_multigpu.sh` ``` bash src/scripts/test_grpo_geoqa_multigpu.sh ``` ## Acknowledgements We sincerely thank [DeepSeek](https://github.com/deepseek-ai/DeepSeek-R1), [Open-R1](https://github.com/huggingface/open-r1), [QwenVL](https://github.com/QwenLM/Qwen2.5-VL), [Open-R1-Multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) (our initial codebase), [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/), [SuperCLEVR](https://github.com/Lizw14/Super-CLEVR), [G-LLAVA](https://arxiv.org/abs/2312.11370) for providing open source resources and to build the project. Special thanks to [Kimi](https://kimi.moonshot.cn/), [bAInance Labs](https://bainancelabs.com/) for supporting computation resources and [Yuxin Wu](https://scholar.google.com/citations?user=mJQI-gUAAAAJ&hl=en), [Xinyu Zhou](https://scholar.google.com/citations?user=Jv4LCj8AAAAJ&hl=en), [Baobao Chang](https://scholar.google.com.au/citations?user=LaKNyhQAAAAJ&hl=en) for their valuable advice. [](https://star-history.com/#Deep-Agent/R1-V&Timeline) ## Citation ```bib @misc{chen2025r1v, author = {Chen, Liang and Li, Lei and Zhao, Haozhe and Song, Yifan and Vinci}, title = {R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than \$3}, howpublished = {\url{https://github.com/Deep-Agent/R1-V}}, note = {Accessed: 2025-02-02}, year = {2025} } ```
Owner
- Login: BUAADreamer
- Kind: user
- Location: Beijing
- Company: Beihang University
- Website: https://buaadreamer.top/
- Repositories: 4
- Profile: https://github.com/BUAADreamer
GitHub Events
Total
- Push event: 2
Last Year
- Push event: 2