graphg
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary
Keywords
Repository
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
Basic Info
- Host: GitHub
- Owner: open-sciencelab
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://g-app-center-000704-6802-aerppvq.openxlab.space/
- Size: 13.9 MB
Statistics
- Stars: 324
- Watchers: 6
- Forks: 27
- Open Issues: 5
- Releases: 1
Topics
Metadata Files
README.md
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
📚 Table of Contents
- 📝 [What is GraphGen?](#-what-is-graphgen) - 📌 [Latest Updates](#-latest-updates) - 🚀 [Quick Start](#-quick-start) - 🏗️ [System Architecture](#-system-architecture) - 🍀 [Acknowledgements](#-acknowledgements) - 📚 [Citation](#-citation) - 📜 [License](#-license) - 📅 [Star History](#-star-history)📝 What is GraphGen?
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the paper and best practice.
Here is post-training result which over 50% SFT data comes from GraphGen and our data clean pipeline.
| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) | | :-: | :-: | :-: | :-: | | Plant| SeedBench | 65.9 | 51.5 | | Common | CMMLU | 73.6 | 75.8 | | Knowledge | GPQA-Diamond | 40.0 | 33.3 | | Math | AIME24 | 20.6 | 16.7 | | | AIME25 | 22.7 | 7.2 |
It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
After data generation, you can use LLaMA-Factory and xtuner to finetune your LLMs.
📌 Latest Updates
- 2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
- 2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
- 2025.04.21: We have released the initial version of GraphGen.
🚀 Quick Start
Experience GraphGen through Web or Backup Web Entrance
For any questions, please check FAQ, open new issue or join our wechat group and ask.
Preparation
Install uv
```bash
You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details
curl -LsSf https://astral.sh/uv/install.sh | sh ```
Clone the repository
bash git clone --depth=1 https://github.com/open-sciencelab/GraphGen cd GraphGenCreate a new uv environment
bash uv venv --python 3.10Configure the dependencies
bash uv pip install -r requirements.txt
Run Gradio Demo
bash
python -m webui.app.py
Run from PyPI
Install GraphGen
bash uv pip install graphgRun in CLI
bash SYNTHESIZER_MODEL=your_synthesizer_model_name \ SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \ SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \ TRAINEE_MODEL=your_trainee_model_name \ TRAINEE_BASE_URL=your_base_url_for_trainee_model \ TRAINEE_API_KEY=your_api_key_for_trainee_model \ graphg --output_dir cache
Run from Source
- Configure the environment
- Create an
.envfile in the root directorybash cp .env.example .env - Set the following environment variables:
bash # Synthesizer is the model used to construct KG and generate data SYNTHESIZER_MODEL=your_synthesizer_model_name SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model # Trainee is the model used to train with the generated data TRAINEE_MODEL=your_trainee_model_name TRAINEE_BASE_URL=your_base_url_for_trainee_model TRAINEE_API_KEY=your_api_key_for_trainee_model
- Create an
- (Optional) Customize generation parameters in
graphgen/configs/folder.
Edit the corresponding YAML file, e.g.:
```yaml
# configs/cot_config.yaml
input_data_type: raw
input_file: resources/input_examples/raw_demo.jsonl
output_data_type: cot
tokenizer: cl100k_base
# additional settings...
```
- Generate data
Pick the desired format and run the matching script:
| Format | Script to run | Notes |
| ------------ | ---------------------------------------------- |-------------------------------------------------------------------|
| cot | bash scripts/generate/generate_cot.sh | Chain-of-Thought Q&A pairs |
| atomic | bash scripts/generate/generate_atomic.sh | Atomic Q&A pairs covering basic knowledge |
| aggregated | bash scripts/generate/generate_aggregated.sh | Aggregated Q&A pairs incorporating complex, integrated knowledge |
| multi-hop | bash scripts/generate/generate_multihop.sh | Multi-hop reasoning Q&A pairs |
- Get the generated data
bash ls cache/data/graphgen
Run with Docker
- Build the Docker image
bash docker build -t graphgen . - Run the Docker container
bash docker run -p 7860:7860 graphgen
🏗️ System Architecture
See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.
Workflow

🍀 Acknowledgements
- SiliconFlow Abundant LLM API, some models are free
- LightRAG Simple and efficient graph retrieval solution
- ROGRAG A robustly optimized GraphRAG framework
- DB-GPT An AI native data app development framework
📚 Citation
If you find this repository useful, please consider citing our work:
bibtex
@misc{chen2025graphgenenhancingsupervisedfinetuning,
title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation},
author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},
year={2025},
eprint={2505.20416},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20416},
}
📜 License
This project is licensed under the Apache License 2.0.
📅 Star History
Owner
- Name: OpenScienceLab
- Login: open-sciencelab
- Kind: organization
- Email: OpenScienceLab@pjlab.org.cn
- Repositories: 1
- Profile: https://github.com/open-sciencelab
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Chen" given-names: "Zihong" - family-names: "Jiang" given-names: "Wanli" - family-names: "Li" given-names: "Jingzhe" - family-names: "Yuan" given-names: "Zhonghang" - family-names: "Wang" given-names: "Chenyang" - family-names: "Kong" given-names: "Huanjun" - family-names: "Dong" given-names: "Nanqing" title: "GraphGen" date-released: 2025-04-21 url: "https://github.com/open-sciencelab/GraphGen"
GitHub Events
Total
- Create event: 23
- Release event: 1
- Issues event: 27
- Watch event: 239
- Delete event: 15
- Issue comment event: 35
- Push event: 96
- Public event: 1
- Pull request review comment event: 6
- Pull request review event: 8
- Pull request event: 46
- Fork event: 25
Last Year
- Create event: 23
- Release event: 1
- Issues event: 27
- Watch event: 239
- Delete event: 15
- Issue comment event: 35
- Push event: 96
- Public event: 1
- Pull request review comment event: 6
- Pull request review event: 8
- Pull request event: 46
- Fork event: 25
Packages
- Total packages: 1
-
Total downloads:
- pypi 35 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 1
- Total maintainers: 1
pypi.org: graphg
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
- Homepage: https://github.com/open-sciencelab/GraphGen
- Documentation: https://graphg.readthedocs.io/
- License: apache-2.0
-
Latest release: 20250416
published 10 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- gradio *
- graspologic *
- jieba *
- kaleido *
- networkx *
- nltk *
- numpy *
- openai *
- pandas *
- plotly *
- pyecharts *
- python-dotenv *
- pyyaml *
- tenacity *
- tiktoken *
- torch *
- tqdm *
- transformers *
- wikipedia *
- actions/checkout v2 composite
- actions/setup-python v2 composite
- pypa/gh-action-pypi-publish release/v1 composite
- python 3.10-slim build