https://github.com/cgcl-codes/graphinstruct

The benchmark proposed in paper: GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability

https://github.com/cgcl-codes/graphinstruct

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
    Organization cgcl-codes has institutional domain (grid.hust.edu.cn)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

The benchmark proposed in paper: GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability

Basic Info
  • Host: GitHub
  • Owner: CGCL-codes
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 45.5 MB
Statistics
  • Stars: 20
  • Watchers: 3
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created almost 2 years ago · Last pushed 7 months ago
Metadata Files
Readme

README.md

GraphInstruct

This is the benchmark proposed in our paper: GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability

Dataset Generation and Evaluation

As a dynamic dataset, GraphInstruct can be generated from scratch and used for evaluation with the following steps:

Environment Install

The required packages can be installed with pip:

cd GTG pip install -e.

[!IMPORTANT] Installation is mandatory.

Dataset Generation

We provide an example script to generate data for all the tasks: GTG/script/run_all_generation.sh. You only need to modify the project_root in the script to your own path, and run:

bash run_all_generation.sh

Then you'll find the generated dataset in GTG/data/dataset.

Evaluation

We provide scripts for evaluation (see GTG/script/evaluation and GTG/script/run_all_evaluation.py). The input data file (i.e. LLM's output) should be a csv with 2 columns: id (sample ID) and output (LLM's output text). For example:

id,output 12,"node 5" 9,"node 33" 33,"node 10"

Model Training

Our implementation for training GraphSolver and GraphSolver+ is mainly based on LLaMAFactory.

Dataset Preparation

  • Due to space limitation, we only provide our training json files for GraphSolver+ in LLaMAFactory/data/reasoning.

  • For getting detailed dataset files, one can refer to the Dataset Generation step in GTG.

Supervised Fine-tuning

One can start the model training step with the following command:

cd LLaMAFactory bash run.sh

Note that, to ensure proper functioning, it is necessary to adjust the experiment settings in examples/train_reasoning/llama3_lora_sft.yaml and examples/merge_reasoning/llama3_lora_sft.yaml.

[!TIP] For more details about the experimental configuration and environment setting, please refer to the readme.md in LLaMAFactory.

Citation

If this work is helpful, please kindly cite as:

bibtex @article{graphinstruct, title={GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability}, author={Zihan Luo and Xiran Song and Hong Huang and Jianxun Lian and Chenhao Zhang and Jinqi Jiang and Xing Xie}, journal={CoRR}, volume={abs/2403.04483}, year={2024}, url={https://doi.org/10.48550/arXiv.2403.04483}, doi={10.48550/ARXIV.2403.04483}, eprinttype={arXiv}, eprint={2403.04483}, }

Acknowledgement

This repo benefits from LLaMAFactory. Thanks for their wonderful work.

Owner

  • Name: CGCL-codes
  • Login: CGCL-codes
  • Kind: organization

CGCL/SCTS/BDTS Lab

GitHub Events

Total
  • Watch event: 6
  • Member event: 1
  • Push event: 5
Last Year
  • Watch event: 6
  • Member event: 1
  • Push event: 5