https://github.com/awslabs/graphstorm

Enterprise graph machine learning framework for billion-scale graphs for ML scientists and data scientists.

https://github.com/awslabs/graphstorm

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary

Keywords

graph graphneuralnetwork machine-learning pytorch
Last synced: 6 months ago · JSON representation

Repository

Enterprise graph machine learning framework for billion-scale graphs for ML scientists and data scientists.

Basic Info
  • Host: GitHub
  • Owner: awslabs
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 9.45 MB
Statistics
  • Stars: 430
  • Watchers: 12
  • Forks: 68
  • Open Issues: 58
  • Releases: 12
Topics
graph graphneuralnetwork machine-learning pytorch
Created about 3 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Code of conduct Notice

README.md

GraphStorm: Enterprise graph machine learning framework for billion-scale graphs

PyPI version CI Status Docs Status

| Documentation and Tutorial Site | GraphStorm Paper |

GraphStorm is an enterprise-grade graph machine learning (GML) framework designed for scalability and ease of use. It simplifies the development and deployment of GML models on industry-scale graphs with billions of nodes and edges.

GraphStorm provides a collection of built-in GML models and users can train a GML model with a single command without writing any code. To help develop SOTA models, GraphStorm provides a large collection of configurations for customizing model implementations and training pipelines to improve model performance. GraphStorm also provides a programming interface to train any custom GML model in a distributed manner. Users provide their own model implementations and use GraphStorm training pipeline to scale.

Key Features

  • Single-command GML model training and inference
  • Distributed training/inference on industry-scale graphs (billions of nodes/edges)
  • Built-in model collection
  • AWS integration out-of-the-box

GraphStorm architecture

Get Started

Installation

GraphStorm is compatible with Python 3.8+. It requires PyTorch 1.13+, DGL 1.0+ and transformers 4.3.0+. For a full quick-start example see the GraphStorm documentation.

You can install and use GraphStorm locally using pip:

```bash

If running on CPU use

pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cpu pip install dgl==2.3.0 -f https://data.dgl.ai/wheels/torch-2.3/repo.html

Or, to run on GPU use

pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu121 pip install dgl==2.3.0+cu121 -f https://data.dgl.ai/wheels/torch-2.3/cu121/repo.html

pip install graphstorm ```

Distributed training

To run GraphStorm in a distributed environment, we recommend using Amazon SageMaker AI to avoid having to manage cluster infrastructure. See our SageMaker AI setup documentation to get started with distributed GNN training.

Quick start

After installing GraphStorm and its requirements in your local environment as shown above, you can clone the GraphStorm repository to follow along the quick start examples:

```bash git clone https://github.com/awslabs/graphstorm.git

Switch to the graphstorm repository root

cd graphstorm ```

Node Classification on OGB arxiv graph

This example demonstrates how to train a model to classify research papers in the OGB arxiv citation network. Each node represents a paper with a 128-dimensional feature vector, and the task is to predict the paper's subject area.

First, download the OGB arxiv data and process it into a DGL graph for the node classification task.

```bash python tools/partitiongraph.py \ --dataset ogbn-arxiv \ --filepath /tmp/ogbn-arxiv-nc/ \ --num-parts 1 \ --output /tmp/ogbnarxivnc1p

```

Second, train an RGCN model to perform node classification on the partitioned arxiv graph.

```bash

create the workspace folder

mkdir /tmp/ogbn-arxiv-nc

python -m graphstorm.run.gsnodeclassification \ --workspace /tmp/ogbn-arxiv-nc \ --num-trainers 1 \ --num-servers 1 \ --part-config /tmp/ogbnarxivnc1p/ogbn-arxiv.json \ --cf "$(pwd)/trainingscripts/gsgnnnp/arxivnc.yaml" \ --save-model-path /tmp/ogbn-arxiv-nc/models ```

Third, run inference using the trained model

bash python -m graphstorm.run.gs_node_classification \ --inference \ --workspace /tmp/ogbn-arxiv-nc \ --num-trainers 1 \ --num-servers 1 \ --part-config /tmp/ogbn_arxiv_nc_1p/ogbn-arxiv.json \ --cf "$(pwd)/training_scripts/gsgnn_np/arxiv_nc.yaml" \ --save-prediction-path /tmp/ogbn-arxiv-nc/predictions/ \ --restore-model-path /tmp/ogbn-arxiv-nc/models/epoch-7/

Link Prediction on OGB arxiv graph

First, download the OGB arxiv data and process it into a DGL graph for a link prediction task. The edge type we are trying to predict is author,writes,paper.

bash python ./tools/partition_graph_lp.py --dataset ogbn-arxiv \ --filepath /tmp/ogbn-arxiv-lp/ \ --num-parts 1 \ --output /tmp/ogbn_arxiv_lp_1p/

Second, train an RGCN model to perform link prediction on the partitioned graph.

bash mkdir /tmp/ogbn-arxiv-lp python -m graphstorm.run.gs_link_prediction \ --workspace /tmp/ogbn-arxiv-lp \ --num-trainers 1 \ --num-servers 1 \ --part-config /tmp/ogbn_arxiv_lp_1p/ogbn-arxiv.json \ --cf "$(pwd)/training_scripts/gsgnn_lp/arxiv_lp.yaml" \ --save-model-path /tmp/ogbn-arxiv-lp/models \ --num-epochs 2

Third, run inference to generate node embeddings that you can use to run node similarity queries

bash python -m graphstorm.run.gs_gen_node_embedding \ --workspace /tmp/ogbn-arxiv-lp \ --num-trainers 1 \ --num-servers 1 \ --part-config /tmp/ogbn_arxiv_lp_1p/ogbn-arxiv.json \ --cf "$(pwd)/training_scripts/gsgnn_lp/arxiv_lp.yaml" \ --save-embed-path /tmp/ogbn-arxiv-lp/embeddings/ \ --restore-model-path /tmp/ogbn-arxiv-lp/models/epoch-1/

For more detailed tutorials and documentation, visit our Documentation site.

Citation

If you use GraphStorm in a scientific publication, we would appreciate citations to the following paper:

@inproceedings{10.1145/3637528.3671603, author = {Zheng, Da and Song, Xiang and Zhu, Qi and Zhang, Jian and Vasiloudis, Theodore and Ma, Runjie and Zhang, Houyu and Wang, Zichen and Adeshina, Soji and Nisa, Israt and Mottini, Alejandro and Cui, Qingjun and Rangwala, Huzefa and Zeng, Belinda and Faloutsos, Christos and Karypis, George}, title = {GraphStorm: All-in-one Graph Machine Learning Framework for Industry Applications}, year = {2024}, url = {https://doi.org/10.1145/3637528.3671603}, doi = {10.1145/3637528.3671603}, booktitle = {Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {6356–6367}, location = {Barcelona, Spain}, series = {KDD '24} }

Blog posts

The GraphStorm team has published multiple blog posts with use-case examples and highlighting new GraphStorm features. These can help new users use GraphStorm in their production use-cases:

Limitations

  • Supports CPU or NVIDIA GPUs for training and inference
  • Multiple samplers only supported in PyTorch versions >= 2.1.0

License

This project is licensed under the Apache-2.0 License.

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 184
  • Total pull requests: 557
  • Average time to close issues: 3 months
  • Average time to close pull requests: 9 days
  • Total issue authors: 21
  • Total pull request authors: 19
  • Average comments per issue: 0.47
  • Average comments per pull request: 0.26
  • Merged pull requests: 391
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 64
  • Pull requests: 240
  • Average time to close issues: 26 days
  • Average time to close pull requests: 4 days
  • Issue authors: 10
  • Pull request authors: 11
  • Average comments per issue: 0.33
  • Average comments per pull request: 0.24
  • Merged pull requests: 169
  • Bot issues: 0
  • Bot pull requests: 2
Top Authors
Issue Authors
  • thvasilo (45)
  • classicsong (38)
  • jalencato (33)
  • zheng-da (20)
  • zhjwy9343 (19)
  • wangz10 (4)
  • GentleZhu (3)
  • Oxfordblue7 (3)
  • YukeWang96 (3)
  • enrique-formation (2)
  • isratnisa (2)
  • prateekdesai04 (2)
  • milianru (2)
  • dombrowsky (1)
  • Diison (1)
Pull Request Authors
  • classicsong (134)
  • jalencato (117)
  • thvasilo (108)
  • zhjwy9343 (97)
  • zheng-da (23)
  • RonaldBXu (17)
  • prateekdesai04 (16)
  • Oxfordblue7 (14)
  • GentleZhu (7)
  • chang-l (7)
  • isratnisa (5)
  • dependabot[bot] (3)
  • wangz10 (2)
  • YukeWang96 (2)
  • znyzhouwl (1)
Top Labels
Issue Labels
bug (21) 0.4 (19) v0.1 (15) enhancement (14) 0.3 (11) 0.3.1 (10) gsprocessing (9) 0.4.1 (7) good first issue (7) 0.4.2 (7) sagemaker (6) v0.1.1 (5) ready (5) documentation (5) break back compatibility (4) v0.2 (3) 0.2.2 (3) dependencies (1) 0.2.1 (1) duplicate (1) 0.5 (1) help wanted (1) python (1)
Pull Request Labels
ready (308) 0.4 (85) documentation (76) 0.3 (62) 0.3.1 (46) gsprocessing (46) bug (44) 0.4.1 (41) enhancement (41) 0.5 (30) 0.2.2 (24) sagemaker (17) break back compatibility (16) 0.4.2 (12) dependencies (8) draft (7) gconstruct (4) API update (4) python (2) dist-partition (1) CI (1) good first issue (1)

Dependencies

.github/workflows/continuous-integration.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • aws-actions/configure-aws-credentials v1 composite
.github/workflows/semgrep.yml actions
  • actions/checkout v3 composite
requirements.txt pypi
  • dgl >=1.0
  • ogb1.3.6 *
  • torch >=1.13.0
  • transformers4.3.0 *
.github/workflows/gsprocessing-workflow.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • aws-actions/configure-aws-credentials v1 composite
docs/requirements.txt pypi
  • dgl ==1.0.4
  • sphinx ==7.1.2
  • sphinx-rtd-theme ==1.3.0
  • torch ==1.13.1
graphstorm-processing/pyproject.toml pypi
  • boto3 ~1.28.1
  • joblib ^1.3.1
  • pandas ^1.3.5
  • psutil ^5.9.5
  • pyarrow ~13.0.0
  • pyspark ~3.3.0
  • python ~3.9.12
  • sagemaker ^2.83.0
  • spacy 3.6.0
setup.py pypi