igb-datasets

Largest realworld open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.

https://github.com/illinoisgraphbenchmark/igb-datasets

Keywords

dataset gnn

Last synced: 11 months ago · JSON representation ·

Repository

Largest realworld open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.

Basic Info

Host: GitHub
Owner: IllinoisGraphBenchmark
License: other
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2302.13522
Size: 55.1 MB

Statistics

Stars: 82
Watchers: 3
Forks: 14
Open Issues: 0
Releases: 0

Topics

dataset gnn

Created almost 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

IGB-Datasets

Official IGB Leadboard is now online!! 🎉

Head over to the leaderboard and make your submission.

Installation Guide

```python

Clone the repo

git clone https://github.com/IllinoisGraphBenchmark/IGB-Datasets.git

Go to the folder root

cd IGB-Datasets

Install the igb package

pip install . ``Now in order to get the dataloader you can:from igb import dataloader`

Get access to dataset

After you install the igb package in order to download igb(h)-tiny, igb(h)-small, igb(h)-medium please follow this code example.

```python

from igb import download download.downloaddataset(path='/root/igbdatasets', datasettype='homogeneous', datasetsize'tiny') Downloaded 0.36 GB: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 366/366 [00:03<00:00, 98.94it/s] Downloaded igb_homogeneous_tiny -> md5sum verified. Final dataset size 0.39 GB. ``The script downloads the zipped files from aws, does a md5sum check and extracts the folder in your specified path. Change thedatasettypeto"heterogeneous"andthe datasetsizeto either"small"or"medium"` in order to get the other datasets.

In the current version if you want the download the igb(h)-large and igb260m/igbh600m please use the bash download scripts provided. Please note these two large datasets require disk space over >500GB.

Important update:

Note: We have updated the paper embedding file of the full dataset. If you have downloaded the dataset prior to 7th November 2023 you will need to update to get the embeddings for the last ~5M paper nodes. To make the process easier so users don't have to re-download the 1TB paper embedding file please follow these steps to update the embedding in place.

First download the embeddings using bash wget --recursive --no-parent https://igb-public.s3.us-east-2.amazonaws.com/IGBH/processed/paper/node_feat_5M_tail.npy

Then run this python script: ```python import numpy as np from tqdm import tqdm

Open the paper embedding file in r+ mode (read/write)

numpapernodes = 269346174 papernodefeatures = np.memmap('/mnt/raid0/full/processed/paper/nodefeat.npy', dtype='float32', mode='r+', shape=(numpaper_nodes,1024))

Open the extra embedding file in read more

numtail = 4957567 nodefeat5Mtail = np.memmap('/mnt/raid0/full/processed/paper/nodefeat5M_tail.npy', dtype='float32', mode='r', shape=(4957567,1024))

Here we do it sequencially to log the progress.

You can do it in parallel by papernodefeatures[offset:] = nodefeat5M_tail

offset = numpapernodes-numtail for i in tqdm(range(numtail)): papernodefeatures[i + offset] = nodefeat5M_tail[i]

flush to save to disk

papernodefeatures.flush() ```

Abstract

Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models.

In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous real-world citation graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162X more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging language models and GNN research projects.

IGB Homogeneous Dataset Metrics

IGB Heterogeneous Dataset Metrics

Downloading dataset

Hosted on AWS. Early access description is provided at the top of this readme.

Usage

(1) Data loaders

We have easy to use DGLDataset dataloader and we will soon add a PyTorch Geometric dataloader. The dataloader takes in arguments for the path of the dataset, the number of classes $\in$ [19, 2983]. You can also mention whether to read the data into the memory or in mmap_mode='r' incase the dataset doesn't fit in your RAM. (Training becomes significantly slower when reading from disk). We can also choose to get a synthetic node embeddings for testing systems.

```python import argparse, dgl from dataloader import IGB260MDGLDataset

parser = argparse.ArgumentParser() parser.addargument('--path', type=str, default='/mnt/nvme14/IGB260M/', help='path containing the datasets') parser.addargument('--datasetsize', type=str, default='tiny', choices=['tiny', 'small', 'medium', 'large', 'full'], help='size of the datasets') parser.addargument('--numclasses', type=int, default=19, choices=[19, 2983], help='number of classes') parser.addargument('--inmemory', type=int, default=0, choices=[0, 1], help='0:read only mmapmode=r, 1:load into memory') parser.addargument('--synthetic', type=int, default=0, choices=[0, 1], help='0:nlp-node embeddings, 1:random') args = parser.parseargs()

dataset = IGB260MDGLDataset(args) graph = dataset[0] print(graph)

Graph(numnodes=100000, numedges=547416,

ndataschemes={'feat': Scheme(shape=(1024,), dtype=torch.float32), 'label': Scheme(shape=(), dtype=torch.int64), 'trainmask': Scheme(shape=(), #dtype=torch.bool), 'valmask': Scheme(shape=(), dtype=torch.bool), 'testmask': Scheme(shape=(), dtype=torch.bool)}

edata_schemes={})

```

(2) Popular GNN Models

We have implmented Graph Convolutional Neural Net (GCN), GraphSAGE and Graph Attention Network (GAT). These models take in the dimension of the input, hidden dimensions and the expected output dimension (which would be your # classes) along with the number of layers, dropout and in case of the GAT model, num of attention heads.

```python import torch from models import *

device = f'cuda:0' if torch.cuda.is_available() else 'cpu'

if args.modeltype == 'gcn': model = GCN(infeats, args.hiddenchannels, args.numclasses, args.numlayers).to(device) if args.modeltype == 'sage': model = SAGE(infeats, args.hiddenchannels, args.numclasses, args.numlayers).to(device) if args.modeltype == 'gat': model = GAT(infeats, args.hiddenchannels, args.numclasses, args.numlayers, args.numheads).to(device) ```

(3) Baseline

We ran each of these models on the IGB dataset collections to get a baseline. Our goal is to enable GNN researchers to develop and test novel models using this dataset. We expect more robust models due to the presence of massive labeled data. We will released detailed analysis of the runs and the hyperparameters along with other relevant experiments in our upcoming paper.

We aim to improve these baselines by testing out more hyperparameters. *Models have been trained for 3 epochs with suboptimal hyperparameters on these datasets.

(4) Multi-GPU Runs

We provide scripts to run the above models on mulitple GPUs using DGL and PyTorch methods. To test it out by running GCN on IGB-tiny with the default hyperparameters you can test it out using:

python train_multi_gpu.py --model_type gcn --dataset_size tiny --num_classes 19 --gpu_devices 0,1,2,3 #For homogenous python train_multi_hetero.py --model_type rgcn --dataset_size tiny --num_classes 19 --gpu_devices 0,1,2,3 #For heterogenous To try single GPU run use: python train_multi_gpu.py --model_type gcn --dataset_size tiny --num_classes 19 --gpu_devices 0 #For homogenous python train_multi_hetero.py --model_type rgcn --dataset_size tiny --num_classes 19 --gpu_devices 0 #For heterogenous or python train_single_gpu.py --model_type gcn --dataset_size tiny --num_classes 19

To learn more about the hyperparameters please take a look at train/train_multi_gpu.py or train/train_multi_hetero.py.

IGB Documentation

Please read our paper in Arxiv.

Contributions

Please check the Contributions.md file for more details.

Questions

Please reach out to Arpandeep Khatua and Vikram Sharma Mailthody
Please feel free to join our Slack Channel.

Future updates

We will be releasing raw text data for enabling NLP+GNN tasks.
Temporal graph datasets.

If you have additional requests, please add them in the issues.

Citations

The work is done using the funds from IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research. If you use datasets, please cite the below article.

@inproceedings{igbdatasets, doi = {10.48550/ARXIV.2302.13522}, url = {https://arxiv.org/abs/2302.13522}, author = {Khatua, Arpandeep and Mailthody, Vikram Sharma and Taleka, Bhagyashree and Ma, Tengfei and Song, Xiang and Hwu, Wen-mei}, title = {IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research}, year = {2023}, booktitle = {In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23)}, series = {KDD '23} copyright = {Creative Commons Attribution 4.0 International} }

Owner

Name: IllinoisGraphBenchmark
Login: IllinoisGraphBenchmark
Kind: organization

Repositories: 3
Profile: https://github.com/IllinoisGraphBenchmark

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  IGB: An Immense Graph Dataset for Machine Learning
  Workloads
message: >-
  If you use this dataset, please cite it using the  
  metadata from this file.
type: dataset
authors:
  - given-names: Arpandeep
    family-names: Khatua
    affiliation: UIUC
  - given-names: Vikram Sharma
    family-names: Mailthody
    email: vsm2@illinois.edu
    affiliation: UIUC/NVIDIA
    orcid: 'https://orcid.org/0000-0002-9611-8075'
  - given-names: Bhagyashree
    family-names: Taleka
    affiliation: USC
  - given-names: Xiang
    family-names: Song
    affiliation: AWS
  - given-names: Tengfei
    family-names: Ma
    affiliation: IBM Research
  - given-names: Piotr
    family-names: Bigaj
    affiliation: NVIDIA
  - given-names: Wen-mei
    family-names: Hwu
    affiliation: UIUC/NVIDIA

GitHub Events

Total

Issues event: 6
Watch event: 7
Issue comment event: 5
Push event: 3
Pull request review event: 3
Pull request event: 5
Fork event: 2

Last Year

Issues event: 6
Watch event: 7
Issue comment event: 5
Push event: 3
Pull request review event: 3
Pull request event: 5
Fork event: 2