graphg

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

https://github.com/open-sciencelab/graphgen

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary

Keywords

ai4science data-generation data-synthesis knowledge-graph llama-factory llm llm-training pretrain pretraining qa question-answering qwen sft sft-data xtuner

Last synced: 6 months ago · JSON representation ·

Repository

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Basic Info

Host: GitHub
Owner: open-sciencelab
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://g-app-center-000704-6802-aerppvq.openxlab.space/
Size: 13.9 MB

Statistics

Stars: 324
Watchers: 6
Forks: 27
Open Issues: 5
Releases: 1

Topics

ai4science data-generation data-synthesis knowledge-graph llama-factory llm llm-training pretrain pretraining qa question-answering qwen sft sft-data xtuner

Created about 1 year ago · Last pushed 6 months ago

Metadata Files

Readme License Citation

README.md

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

English | 中文

📚 Table of Contents

- 📝 [What is GraphGen?](#-what-is-graphgen) - 📌 [Latest Updates](#-latest-updates) - 🚀 [Quick Start](#-quick-start) - 🏗️ [System Architecture](#-system-architecture) - 🍀 [Acknowledgements](#-acknowledgements) - 📚 [Citation](#-citation) - 📜 [License](#-license) - 📅 [Star History](#-star-history)

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the paper and best practice.

Here is post-training result which over 50% SFT data comes from GraphGen and our data clean pipeline.

| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) | | :-: | :-: | :-: | :-: | | Plant| SeedBench | 65.9 | 51.5 | | Common | CMMLU | 73.6 | 75.8 | | Knowledge | GPQA-Diamond | 40.0 | 33.3 | | Math | AIME24 | 20.6 | 16.7 | | | AIME25 | 22.7 | 7.2 |

It begins by constructing a fine-grained knowledge graph from the source text，then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

After data generation, you can use LLaMA-Factory and xtuner to finetune your LLMs.

📌 Latest Updates

2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
2025.04.21: We have released the initial version of GraphGen.

🚀 Quick Start

Experience GraphGen through Web or Backup Web Entrance

For any questions, please check FAQ, open new issue or join our wechat group and ask.

Preparation

Install uv

```bash

You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details

curl -LsSf https://astral.sh/uv/install.sh | sh ```
Clone the repository

bash git clone --depth=1 https://github.com/open-sciencelab/GraphGen cd GraphGen
Create a new uv environment

bash uv venv --python 3.10
Configure the dependencies

bash uv pip install -r requirements.txt

Run Gradio Demo

bash python -m webui.app.py

Run from PyPI

Install GraphGen bash uv pip install graphg
Run in CLI bash SYNTHESIZER_MODEL=your_synthesizer_model_name \ SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \ SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \ TRAINEE_MODEL=your_trainee_model_name \ TRAINEE_BASE_URL=your_base_url_for_trainee_model \ TRAINEE_API_KEY=your_api_key_for_trainee_model \ graphg --output_dir cache

Run from Source

Configure the environment
- Create an .env file in the root directory bash cp .env.example .env
- Set the following environment variables: bash # Synthesizer is the model used to construct KG and generate data SYNTHESIZER_MODEL=your_synthesizer_model_name SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model # Trainee is the model used to train with the generated data TRAINEE_MODEL=your_trainee_model_name TRAINEE_BASE_URL=your_base_url_for_trainee_model TRAINEE_API_KEY=your_api_key_for_trainee_model
(Optional) Customize generation parameters in graphgen/configs/ folder.

Edit the corresponding YAML file, e.g.:

```yaml
  # configs/cot_config.yaml
  input_data_type: raw
  input_file: resources/input_examples/raw_demo.jsonl
  output_data_type: cot
  tokenizer: cl100k_base
  # additional settings...
```

Generate data

Pick the desired format and run the matching script:

| Format | Script to run | Notes | | ------------ | ---------------------------------------------- |-------------------------------------------------------------------| | cot | bash scripts/generate/generate_cot.sh | Chain-of-Thought Q&A pairs | | atomic | bash scripts/generate/generate_atomic.sh | Atomic Q&A pairs covering basic knowledge | | aggregated | bash scripts/generate/generate_aggregated.sh | Aggregated Q&A pairs incorporating complex, integrated knowledge | | multi-hop | bash scripts/generate/generate_multihop.sh | Multi-hop reasoning Q&A pairs |

Get the generated data bash ls cache/data/graphgen

Run with Docker

Build the Docker image bash docker build -t graphgen .
Run the Docker container bash docker run -p 7860:7860 graphgen

🏗️ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

workflow

🍀 Acknowledgements

SiliconFlow Abundant LLM API, some models are free
LightRAG Simple and efficient graph retrieval solution
ROGRAG A robustly optimized GraphRAG framework
DB-GPT An AI native data app development framework

📚 Citation

If you find this repository useful, please consider citing our work: bibtex @misc{chen2025graphgenenhancingsupervisedfinetuning, title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation}, author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong}, year={2025}, eprint={2505.20416}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.20416}, }

📜 License

This project is licensed under the Apache License 2.0.

📅 Star History

Owner

Name: OpenScienceLab
Login: open-sciencelab
Kind: organization
Email: OpenScienceLab@pjlab.org.cn

Repositories: 1
Profile: https://github.com/open-sciencelab

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Chen"
  given-names: "Zihong"
- family-names: "Jiang"
  given-names: "Wanli"
- family-names: "Li"
  given-names: "Jingzhe"
- family-names: "Yuan"
  given-names: "Zhonghang"
- family-names: "Wang"
  given-names: "Chenyang"
- family-names: "Kong"
  given-names: "Huanjun"
- family-names: "Dong"
  given-names: "Nanqing"
title: "GraphGen"
date-released: 2025-04-21
url: "https://github.com/open-sciencelab/GraphGen"

GitHub Events

Total

Create event: 23
Release event: 1
Issues event: 27
Watch event: 239
Delete event: 15
Issue comment event: 35
Push event: 96
Public event: 1
Pull request review comment event: 6
Pull request review event: 8
Pull request event: 46
Fork event: 25

Last Year

Create event: 23
Release event: 1
Issues event: 27
Watch event: 239
Delete event: 15
Issue comment event: 35
Push event: 96
Public event: 1
Pull request review comment event: 6
Pull request review event: 8
Pull request event: 46
Fork event: 25

Packages

Total packages: 1
Total downloads:
- pypi 35 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 1
Total maintainers: 1

pypi.org: graphg

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Homepage: https://github.com/open-sciencelab/GraphGen
Documentation: https://graphg.readthedocs.io/
License: apache-2.0
Latest release: 20250416
published 10 months ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 35 Last month

Rankings

Dependent packages count: 9.3%

Stargazers count: 20.6%

Forks count: 24.3%

Average: 26.6%

Dependent repos count: 52.1%

Maintainers (1)

tpoisonooo

Last synced: 6 months ago

Dependencies

.github/workflows/pylint.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

requirements.txt pypi

gradio *
graspologic *
jieba *
kaleido *
networkx *
nltk *
numpy *
openai *
pandas *
plotly *
pyecharts *
python-dotenv *
pyyaml *
tenacity *
tiktoken *
torch *
tqdm *
transformers *
wikipedia *

.github/workflows/workflow.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
pypa/gh-action-pypi-publish release/v1 composite

setup.py pypi

Dockerfile docker

python 3.10-slim build

graphg

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

📝 What is GraphGen?

📌 Latest Updates

🚀 Quick Start

Preparation

You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details

Run Gradio Demo

Run from PyPI

Run from Source

Run with Docker

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

📅 Star History

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: graphg

Rankings

Maintainers (1)

Dependencies