igb-datasets
Largest realworld open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.0%) to scientific vocabulary
Keywords
Repository
Largest realworld open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.
Basic Info
- Host: GitHub
- Owner: IllinoisGraphBenchmark
- License: other
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2302.13522
- Size: 55.1 MB
Statistics
- Stars: 82
- Watchers: 3
- Forks: 14
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
IGB-Datasets
Official IGB Leadboard is now online!! 🎉
Head over to the leaderboard and make your submission.
Installation Guide
```python
Clone the repo
git clone https://github.com/IllinoisGraphBenchmark/IGB-Datasets.git
Go to the folder root
cd IGB-Datasets
Install the igb package
pip install . ``
Now in order to get the dataloader you can:from igb import dataloader`
Get access to dataset
After you install the igb package in order to download igb(h)-tiny, igb(h)-small, igb(h)-medium please follow this code example.
```python
from igb import download download.downloaddataset(path='/root/igbdatasets', datasettype='homogeneous', datasetsize'tiny') Downloaded 0.36 GB: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 366/366 [00:03<00:00, 98.94it/s] Downloaded igb_homogeneous_tiny -> md5sum verified. Final dataset size 0.39 GB. ``
The script downloads the zipped files from aws, does a md5sum check and extracts the folder in your specified path. Change thedatasettypeto"heterogeneous"andthe datasetsizeto either"small"or"medium"` in order to get the other datasets.
In the current version if you want the download the igb(h)-large and igb260m/igbh600m please use the bash download scripts provided. Please note these two large datasets require disk space over >500GB.
Important update:
Note: We have updated the paper embedding file of the full dataset. If you have downloaded the dataset prior to 7th November 2023 you will need to update to get the embeddings for the last ~5M paper nodes. To make the process easier so users don't have to re-download the 1TB paper embedding file please follow these steps to update the embedding in place.
First download the embeddings using
bash
wget --recursive --no-parent https://igb-public.s3.us-east-2.amazonaws.com/IGBH/processed/paper/node_feat_5M_tail.npy
Then run this python script: ```python import numpy as np from tqdm import tqdm
Open the paper embedding file in r+ mode (read/write)
numpapernodes = 269346174 papernodefeatures = np.memmap('/mnt/raid0/full/processed/paper/nodefeat.npy', dtype='float32', mode='r+', shape=(numpaper_nodes,1024))
Open the extra embedding file in read more
numtail = 4957567 nodefeat5Mtail = np.memmap('/mnt/raid0/full/processed/paper/nodefeat5M_tail.npy', dtype='float32', mode='r', shape=(4957567,1024))
Here we do it sequencially to log the progress.
You can do it in parallel by papernodefeatures[offset:] = nodefeat5M_tail
offset = numpapernodes-numtail for i in tqdm(range(numtail)): papernodefeatures[i + offset] = nodefeat5M_tail[i]
flush to save to disk
papernodefeatures.flush() ```
Abstract
Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models.
In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous real-world citation graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162X more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging language models and GNN research projects.
IGB Homogeneous Dataset Metrics
IGB Heterogeneous Dataset Metrics
Downloading dataset
Hosted on AWS. Early access description is provided at the top of this readme.
Usage
(1) Data loaders
We have easy to use DGLDataset dataloader and we will soon add a PyTorch Geometric dataloader. The dataloader takes in arguments for the path of the dataset, the number of classes $\in$ [19, 2983]. You can also mention whether to read the data into the memory or in mmap_mode='r' incase the dataset doesn't fit in your RAM. (Training becomes significantly slower when reading from disk). We can also choose to get a synthetic node embeddings for testing systems.
```python import argparse, dgl from dataloader import IGB260MDGLDataset
parser = argparse.ArgumentParser() parser.addargument('--path', type=str, default='/mnt/nvme14/IGB260M/', help='path containing the datasets') parser.addargument('--datasetsize', type=str, default='tiny', choices=['tiny', 'small', 'medium', 'large', 'full'], help='size of the datasets') parser.addargument('--numclasses', type=int, default=19, choices=[19, 2983], help='number of classes') parser.addargument('--inmemory', type=int, default=0, choices=[0, 1], help='0:read only mmapmode=r, 1:load into memory') parser.addargument('--synthetic', type=int, default=0, choices=[0, 1], help='0:nlp-node embeddings, 1:random') args = parser.parseargs()
dataset = IGB260MDGLDataset(args) graph = dataset[0] print(graph)
Graph(numnodes=100000, numedges=547416,
ndataschemes={'feat': Scheme(shape=(1024,), dtype=torch.float32), 'label': Scheme(shape=(), dtype=torch.int64), 'trainmask': Scheme(shape=(), #dtype=torch.bool), 'valmask': Scheme(shape=(), dtype=torch.bool), 'testmask': Scheme(shape=(), dtype=torch.bool)}
edata_schemes={})
```
(2) Popular GNN Models
We have implmented Graph Convolutional Neural Net (GCN), GraphSAGE and Graph Attention Network (GAT). These models take in the dimension of the input, hidden dimensions and the expected output dimension (which would be your # classes) along with the number of layers, dropout and in case of the GAT model, num of attention heads.
```python import torch from models import *
device = f'cuda:0' if torch.cuda.is_available() else 'cpu'
if args.modeltype == 'gcn': model = GCN(infeats, args.hiddenchannels, args.numclasses, args.numlayers).to(device) if args.modeltype == 'sage': model = SAGE(infeats, args.hiddenchannels, args.numclasses, args.numlayers).to(device) if args.modeltype == 'gat': model = GAT(infeats, args.hiddenchannels, args.numclasses, args.numlayers, args.numheads).to(device) ```
(3) Baseline
We ran each of these models on the IGB dataset collections to get a baseline. Our goal is to enable GNN researchers to develop and test novel models using this dataset. We expect more robust models due to the presence of massive labeled data. We will released detailed analysis of the runs and the hyperparameters along with other relevant experiments in our upcoming paper.
We aim to improve these baselines by testing out more hyperparameters. *Models have been trained for 3 epochs with suboptimal hyperparameters on these datasets.
(4) Multi-GPU Runs
We provide scripts to run the above models on mulitple GPUs using DGL and PyTorch methods. To test it out by running GCN on IGB-tiny with the default hyperparameters you can test it out using:
python train_multi_gpu.py --model_type gcn --dataset_size tiny --num_classes 19 --gpu_devices 0,1,2,3 #For homogenous
python train_multi_hetero.py --model_type rgcn --dataset_size tiny --num_classes 19 --gpu_devices 0,1,2,3 #For heterogenous
To try single GPU run use:
python train_multi_gpu.py --model_type gcn --dataset_size tiny --num_classes 19 --gpu_devices 0 #For homogenous
python train_multi_hetero.py --model_type rgcn --dataset_size tiny --num_classes 19 --gpu_devices 0 #For heterogenous
or
python train_single_gpu.py --model_type gcn --dataset_size tiny --num_classes 19
To learn more about the hyperparameters please take a look at train/train_multi_gpu.py or train/train_multi_hetero.py.
IGB Documentation
Please read our paper in Arxiv.
Contributions
Please check the Contributions.md file for more details.
Questions
- Please reach out to Arpandeep Khatua and Vikram Sharma Mailthody
- Please feel free to join our Slack Channel.
Future updates
- We will be releasing raw text data for enabling NLP+GNN tasks.
- Temporal graph datasets.
If you have additional requests, please add them in the issues.
Citations
The work is done using the funds from IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research. If you use datasets, please cite the below article.
@inproceedings{igbdatasets,
doi = {10.48550/ARXIV.2302.13522},
url = {https://arxiv.org/abs/2302.13522},
author = {Khatua, Arpandeep and Mailthody, Vikram Sharma and Taleka, Bhagyashree and Ma, Tengfei and Song, Xiang and Hwu, Wen-mei},
title = {IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research},
year = {2023},
booktitle = {In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23)},
series = {KDD '23}
copyright = {Creative Commons Attribution 4.0 International}
}
Owner
- Name: IllinoisGraphBenchmark
- Login: IllinoisGraphBenchmark
- Kind: organization
- Repositories: 3
- Profile: https://github.com/IllinoisGraphBenchmark
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
IGB: An Immense Graph Dataset for Machine Learning
Workloads
message: >-
If you use this dataset, please cite it using the
metadata from this file.
type: dataset
authors:
- given-names: Arpandeep
family-names: Khatua
affiliation: UIUC
- given-names: Vikram Sharma
family-names: Mailthody
email: vsm2@illinois.edu
affiliation: UIUC/NVIDIA
orcid: 'https://orcid.org/0000-0002-9611-8075'
- given-names: Bhagyashree
family-names: Taleka
affiliation: USC
- given-names: Xiang
family-names: Song
affiliation: AWS
- given-names: Tengfei
family-names: Ma
affiliation: IBM Research
- given-names: Piotr
family-names: Bigaj
affiliation: NVIDIA
- given-names: Wen-mei
family-names: Hwu
affiliation: UIUC/NVIDIA
GitHub Events
Total
- Issues event: 6
- Watch event: 7
- Issue comment event: 5
- Push event: 3
- Pull request review event: 3
- Pull request event: 5
- Fork event: 2
Last Year
- Issues event: 6
- Watch event: 7
- Issue comment event: 5
- Push event: 3
- Pull request review event: 3
- Pull request event: 5
- Fork event: 2