graphg

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

https://github.com/open-sciencelab/graphgen

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.2%) to scientific vocabulary

Keywords

ai4science data-generation data-synthesis knowledge-graph llama-factory llm llm-training pretrain pretraining qa question-answering qwen sft sft-data xtuner
Last synced: 6 months ago · JSON representation ·

Repository

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Basic Info
Statistics
  • Stars: 324
  • Watchers: 6
  • Forks: 27
  • Open Issues: 5
  • Releases: 1
Topics
ai4science data-generation data-synthesis knowledge-graph llama-factory llm llm-training pretrain pretraining qa question-answering qwen sft sft-data xtuner
Created about 1 year ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

stars forks open issues issue resolution documentation wechat arXiv Hugging Face

Hugging Face OpenXLab

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

English | 中文

📚 Table of Contents - 📝 [What is GraphGen?](#-what-is-graphgen) - 📌 [Latest Updates](#-latest-updates) - 🚀 [Quick Start](#-quick-start) - 🏗️ [System Architecture](#-system-architecture) - 🍀 [Acknowledgements](#-acknowledgements) - 📚 [Citation](#-citation) - 📜 [License](#-license) - 📅 [Star History](#-star-history)

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the paper and best practice.

Here is post-training result which over 50% SFT data comes from GraphGen and our data clean pipeline.

| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) | | :-: | :-: | :-: | :-: | | Plant| SeedBench | 65.9 | 51.5 | | Common | CMMLU | 73.6 | 75.8 | | Knowledge | GPQA-Diamond | 40.0 | 33.3 | | Math | AIME24 | 20.6 | 16.7 | | | AIME25 | 22.7 | 7.2 |

It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

After data generation, you can use LLaMA-Factory and xtuner to finetune your LLMs.

📌 Latest Updates

  • 2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
  • 2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
  • 2025.04.21: We have released the initial version of GraphGen.

🚀 Quick Start

Experience GraphGen through Web or Backup Web Entrance

For any questions, please check FAQ, open new issue or join our wechat group and ask.

Preparation

  1. Install uv

    ```bash

    You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details

    curl -LsSf https://astral.sh/uv/install.sh | sh ```

  2. Clone the repository

    bash git clone --depth=1 https://github.com/open-sciencelab/GraphGen cd GraphGen

  3. Create a new uv environment

    bash uv venv --python 3.10

  4. Configure the dependencies

    bash uv pip install -r requirements.txt

Run Gradio Demo

bash python -m webui.app.py

ui

Run from PyPI

  1. Install GraphGen bash uv pip install graphg

  2. Run in CLI bash SYNTHESIZER_MODEL=your_synthesizer_model_name \ SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \ SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \ TRAINEE_MODEL=your_trainee_model_name \ TRAINEE_BASE_URL=your_base_url_for_trainee_model \ TRAINEE_API_KEY=your_api_key_for_trainee_model \ graphg --output_dir cache

Run from Source

  1. Configure the environment
    • Create an .env file in the root directory bash cp .env.example .env
    • Set the following environment variables: bash # Synthesizer is the model used to construct KG and generate data SYNTHESIZER_MODEL=your_synthesizer_model_name SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model # Trainee is the model used to train with the generated data TRAINEE_MODEL=your_trainee_model_name TRAINEE_BASE_URL=your_base_url_for_trainee_model TRAINEE_API_KEY=your_api_key_for_trainee_model
  2. (Optional) Customize generation parameters in graphgen/configs/ folder.

Edit the corresponding YAML file, e.g.:

```yaml
  # configs/cot_config.yaml
  input_data_type: raw
  input_file: resources/input_examples/raw_demo.jsonl
  output_data_type: cot
  tokenizer: cl100k_base
  # additional settings...
```
  1. Generate data

Pick the desired format and run the matching script:

| Format | Script to run | Notes | | ------------ | ---------------------------------------------- |-------------------------------------------------------------------| | cot | bash scripts/generate/generate_cot.sh | Chain-of-Thought Q&A pairs | | atomic | bash scripts/generate/generate_atomic.sh | Atomic Q&A pairs covering basic knowledge | | aggregated | bash scripts/generate/generate_aggregated.sh | Aggregated Q&A pairs incorporating complex, integrated knowledge | | multi-hop | bash scripts/generate/generate_multihop.sh | Multi-hop reasoning Q&A pairs |

  1. Get the generated data bash ls cache/data/graphgen

Run with Docker

  1. Build the Docker image bash docker build -t graphgen .
  2. Run the Docker container bash docker run -p 7860:7860 graphgen

🏗️ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

workflow

🍀 Acknowledgements

  • SiliconFlow Abundant LLM API, some models are free
  • LightRAG Simple and efficient graph retrieval solution
  • ROGRAG A robustly optimized GraphRAG framework
  • DB-GPT An AI native data app development framework

📚 Citation

If you find this repository useful, please consider citing our work: bibtex @misc{chen2025graphgenenhancingsupervisedfinetuning, title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation}, author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong}, year={2025}, eprint={2505.20416}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.20416}, }

📜 License

This project is licensed under the Apache License 2.0.

📅 Star History

Star History Chart

Owner

  • Name: OpenScienceLab
  • Login: open-sciencelab
  • Kind: organization
  • Email: OpenScienceLab@pjlab.org.cn

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Chen"
  given-names: "Zihong"
- family-names: "Jiang"
  given-names: "Wanli"
- family-names: "Li"
  given-names: "Jingzhe"
- family-names: "Yuan"
  given-names: "Zhonghang"
- family-names: "Wang"
  given-names: "Chenyang"
- family-names: "Kong"
  given-names: "Huanjun"
- family-names: "Dong"
  given-names: "Nanqing"
title: "GraphGen"
date-released: 2025-04-21
url: "https://github.com/open-sciencelab/GraphGen"

GitHub Events

Total
  • Create event: 23
  • Release event: 1
  • Issues event: 27
  • Watch event: 239
  • Delete event: 15
  • Issue comment event: 35
  • Push event: 96
  • Public event: 1
  • Pull request review comment event: 6
  • Pull request review event: 8
  • Pull request event: 46
  • Fork event: 25
Last Year
  • Create event: 23
  • Release event: 1
  • Issues event: 27
  • Watch event: 239
  • Delete event: 15
  • Issue comment event: 35
  • Push event: 96
  • Public event: 1
  • Pull request review comment event: 6
  • Pull request review event: 8
  • Pull request event: 46
  • Fork event: 25

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 35 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
pypi.org: graphg

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 35 Last month
Rankings
Dependent packages count: 9.3%
Stargazers count: 20.6%
Forks count: 24.3%
Average: 26.6%
Dependent repos count: 52.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/pylint.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
requirements.txt pypi
  • gradio *
  • graspologic *
  • jieba *
  • kaleido *
  • networkx *
  • nltk *
  • numpy *
  • openai *
  • pandas *
  • plotly *
  • pyecharts *
  • python-dotenv *
  • pyyaml *
  • tenacity *
  • tiktoken *
  • torch *
  • tqdm *
  • transformers *
  • wikipedia *
.github/workflows/workflow.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • pypa/gh-action-pypi-publish release/v1 composite
setup.py pypi
Dockerfile docker
  • python 3.10-slim build