open-medical-r1
This repository is aim to reproduce the R1-Zero on medical domain.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.2%) to scientific vocabulary
Repository
This repository is aim to reproduce the R1-Zero on medical domain.
Basic Info
- Host: GitHub
- Owner: Qsingle
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 1.01 MB
Statistics
- Stars: 25
- Watchers: 1
- Forks: 2
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
Medical R1-Zero Reproduce
News
- 2025.04.07 Release the weights for Gemma-3-12b-it-grpo that trained on 500 samples by LoRA.
- 2025.04.02 Release the weights for Gemma-3-12b-bnb-grpo that trained on 500 samples by Q-LoRA.
- 2025.03.04 Release the codes and initial report.
Model and Dataset
Similar to the paper MED-RLVR, we have tried the Qwen2.5-1.5B, Qwen2.5-3B, and Qwen2.5-7B as the base model and train by the GRPO. However, the model can not occur the "aha moment", which is the self-validation behaviour. The model's behaviour during training is similar to MED-RLVR, so we do not report the results of these models. We conjecture the model's behaviour is related to the knowledge and the training data. Thus, we use the HuatuoGPT-o1-7B as the base model to do the experiments on the following three datasets:
- MedQA-USMLE: We randomly choose 1090 samples from the training dataset. We called it dataset1 in the following.
- MedQA-USMLE+MedXpertQA: We randomly choose 600 samples from the training dataset of MedQA-USMLE and 490 samples from MedXpertQA. We called it dataset2 in the following.
- MedXpertQA: We randomly choose 490 samples from the MedXpertQA dataset. We called it dataset3 in the following.
Experiments
Settings
The model was trained on 4 NVIDIA Tesla A100 40GB SXM GPUs using the Open-R1 framework. We removed both cosine similarity reward and reasoning process reward components. Experimental results indicate that the observed response length constraints under cosine reward mechanisms are directly attributable to insufficient domain knowledge - the model's inability to form valid answers under knowledge deficits creates premature output truncation. This knowledge-dependent length limitation represents the root cause of sampling challenges. However, the model can generate a chain of thought to solve the problem naturally, so we think the reasoning and cosine rewards are not essential.
Metric Results
We evaluated the model's performance across six medical benchmarks: MedMCQA, MedQA-USMLE, PubMedQA, MMLU-Pro Medical, GPQA Medical, and MedXpertQA, with results presented in Tab. 1. Unexpectedly, post-training performance degradation was observed compared to baseline metrics. We hypothesize that insufficient training duration and limited sample size may account for this degradation, which will be systematically investigated in follow-up experiments.
As shown in Tab. 1, the model trained on Dataset 2 demonstrates superior and more consistent performance across all benchmarks. This suggests that a balanced composition of complex and straightforward cases in training data may facilitate both performance stability and reasoning capability development.
Notably, Dataset 3 - exclusively comprising complex cases - enabled the best performance on MedMCQA and MedQA-USMLE benchmarks compared to Datasets 1 and 2. This phenomenon might be attributed to the chain-of-thought (CoT) generation capability fostered by exposure to intricate medical reasoning patterns, indicating that complex case training could enhance problem-solving strategies in standardized medical examinations.
| Model | MedMCQA | MedQA-USMLE | PubMedQA | MMLU-Pro Medical | GPQA Medical | MedXpertQA | | :-------------: | :---------: | :----------: | :------: | :--------------: | :----------: | :--------: | | QwQ-32B | 67.32 | 80.13 | 75.80 | 73.88 | 60.77 | 20.92 | | HuatuoGPT-o1-7B | 63.57 | 71.56 | 78.50 | 67.17 | 52.56 | 15.51 | | Our(Dataset1) | 51.64 | 55.53 | 74.70 | 58.04 | 41.79 | 14.89 | | Our(Dataset2) | 52.61 | 54.91 | 76.80 | 58.17 | 45.38 | 15.36 | | Our(Dataset3) | 55.01 | 56.48 | 76.50 | 50.81 | 36.66 | 12.60 |
Logs
As illustrated in Fig. 1, experimental results from Dataset 1 reveal a positive correlation between output length and reward magnitude. The observed trend demonstrates that increased response length corresponds to higher reward values, while reductions in output length lead to diminished reward signals - indicating a possible inherent length-reward dependency in the model's optimization framework.

Fig. 1 The training log for experiments on dataset1.
Furthermore, we analyzed the model's output during training and observed a phenomenon reminiscent of the "Aha Moment" described in Deepseek's technical report. As illustrated in Fig. 2, the red bounding box highlights what appears to be the model's self-validation mechanism. However, we noted instances where the model appears to re-examine problems, which we hypothesize may be attributed to the base model's behavior.
Interestingly, our experiments demonstrate that this phenomenon is mitigated when incorporating MedXpertQA dataset into the training process. We posit that exposure to complex reasoning data enhances the model's capability to achieve more consistent performance through improved generation strategy.

Fig.2 The sample for the self-validation of the model trained on dataset1.
Similar to the experiments on dataset1, the length of the competition is larger, the reward is higher, and a decrease in the length may result in a lower reward.

Fig.3 The training log for experiments on dataset2.
Similar to the experiment on the dataset2, we also found the self-validation step on the model's output, but the format for the model is more better compare the model trained on the dataset1.

Fig.4 The sample for the self-validation of the model trained on dataset2.
In Dataset 3, the model struggles to generate accurate answers while producing excessively long responses, as evidenced by Fig. 5.
Fig.5 The training log for experiments on dataset3.
Although Fig. 6 reveals the model's capacity to develop systematic reasoning pathways with iterative self-correction attempts, the final output paradoxically deviates from both the expected answer format and the logical conclusions suggested by its own cognitive trajectory.
Fig.6 The sample for the self-validation of the model trained on dataset3.
Conclusion
Our experiments demonstrate that reinforcement learning holds significant potential for addressing challenges in vertical domains such as healthcare. However, the construction of training datasets requires careful consideration of the ratio between complex and simple examples. Experimental results confirm that complex examples facilitate the formation of extended chain-of-thought reasoning. Furthermore, successful implementation in vertical domain training necessitates the development of more rational reward assignment criteria. The strategic incorporation of domain-specific knowledge also emerges as a critical factor for effective application.
Usage
Training
To prepare the environment, you can follow the steps in Open-R1. Then, run the following instructions to do the training. Note that the num_processes must equal the number of GPUs - 1.
```shell export TASK="grpo"
ACCELERATELOGLEVEL=info TRANSFORMERSVERBOSITY=info accelerate launch \ --numprocesses 3 \ --mainprocessport 6688 \ --configfile recipes/accelerateconfigs/zero3.yaml \ src/openr1/$TASK.py --config recipes/HuatuoGPT-o1/grpo/configmedxpert_usmle.yaml ```
Data prepare
We suggest using one dataset containing easy and hard samples to help the model learn better and generate the correct chain of thought. Curriculum learning and mixing the complex and easy samples in one batch may help the training. You can use our provide scripts to prepare the data from MedXpertQA dataset and MedQA-USMLE dataset.
shell
python scripts/data_prepare.py \
--medxpertqa_root /path/to/medxpertqa \
--medqa_usmle_root /path/to/medqa_usmle \
--output_dir ./output/xpert_usmle
Weights
| Base model | Quantization | LoRA | Samples | Link | | :------------: | :----------: | :----: | :-----: | :----------------------------------------------------------: | | Gemma-3-12b-it | N/A | LoRA | 500 | 🤗huggingface | | Gemma-3-12b-it | 4bit | Q-LoRA | 500 | 🤗huggingface |
TODO
- [x] Release the code.
- [x] Release the checkpoints.
- [x] Release the technical report.
- [x] Release the evaluation results & evaluation codes.
- [x] Complete the experiments on dataset3.
Acknowledge
We gratefully acknowledge the support from the Department of Computer Science and Engineering at Southern University of Science and Technology, whose High-Performance Computing (HPC) platform generously provided GPU resources that facilitated the completion of our experiments.
Future Plan
- Develop a hierarchical QA dataset using pre-trained models with explicit difficulty grading to enable systematic analysis of learning dynamics across complexity levels
- Investigate curriculum learning strategies through systematic analysis of easy-to-difficult data ratios and phased training configurations to optimize knowledge acquisition trajectories
- Implement tool-augmented reasoning frameworks that integrate model-driven knowledge retrieval from domain-specific databases (e.g., Search-R1 architecture) for enhanced decision support capabilities
- Design multimodal reinforcement learning algorithms incorporating cross-modal alignment mechanisms to address heterogeneous data integration in real-world scenarios
- Secure dedicated GPU resources to conduct large-scale RL experiments for LLM fine-tuning in specialized domains like medical diagnosis, requiring both computational intensity and domain-specific safety validation
Owner
- Login: Qsingle
- Kind: user
- Repositories: 1
- Profile: https://github.com/Qsingle
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Open-Medical-R1
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Zhongxi
family-names: Qiu
- given-names: Zhang
family-names: Zhang
- given-names: Yan
family-names: Hu
- given-names: Heng
family-names: Li
- given-names: Jiang
family-names: Liu
identifiers:
- type: url
value: 'https://github.com/Qsingle/open-medical-r1'
description: Github
repository-code: 'https://github.com/Qsingle/open-medical-r1'
abstract: >-
Exploring the GRPO(Group Relative Policy Optimization) in
the medical domain.
keywords:
- Large Language Model
- GRPO
- Smart Medical
license: MIT
version: '0.1'
GitHub Events
Total
- Issues event: 1
- Watch event: 27
- Issue comment event: 7
- Push event: 9
- Fork event: 2
- Create event: 7
Last Year
- Issues event: 1
- Watch event: 27
- Issue comment event: 7
- Push event: 9
- Fork event: 2
- Create event: 7
Dependencies
- deps *