graph-scaling

A sampling based method for scaling graph (network) data sets.

https://github.com/amusaafir/graph-scaling

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: acm.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary

Keywords

benchmark forest-fire-sampling graph-datasets graph-sampling graph-scaling graphs random-edge-sampling random-node-sampling sampling totally-induced-edge-sampling
Last synced: 6 months ago · JSON representation ·

Repository

A sampling based method for scaling graph (network) data sets.

Basic Info
  • Host: GitHub
  • Owner: amusaafir
  • License: gpl-3.0
  • Language: C++
  • Default Branch: master
  • Homepage:
  • Size: 841 KB
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
benchmark forest-fire-sampling graph-datasets graph-sampling graph-scaling graphs random-edge-sampling random-node-sampling sampling totally-induced-edge-sampling
Created about 8 years ago · Last pushed almost 4 years ago
Metadata Files
Readme License Citation

README.md

Graph Scaling Tool

A Sampling-based Method for Scaling Graph Datasets

This tool allows users to scale a graph up or down using sampling as a basis for both operations.

Paper: https://dl.acm.org/doi/10.1145/3358960.3379144

Scaling Guidelines and Results

  • Refer to the scaling up guidelines to target certain property changes whenever scaling up a graph.

  • The scaling-results document shows the results of several scaling operations conducted on different graph datasets.

Usage

Compilation

Dependencies (install them by running install_dependencies.sh): 1. googletest - Optional for compiling the test project. 2. snap-stanford - Required for the auto-tuner.

mkdir build cd build cmake ../ make -j$NPROC

Single thread and machine

Running graph_scaling_tool without any parameter should prompt the user several input requests accordingly, including whether a scaling up or scaling down operation should be performed.

To perform one of the two operations directly, the user must specify at least:

  • The input graph path: -i (e.g., -i /home/user/graph.csv). Note that this can be any format, as long as the input graph is an edge list (where each vertex is represented numerically)
  • The output (folder) path: -o (e.g., -o /home/user/)

To directly perform a scaling down operation, the -s parameter is required to specify the preferred sample size (based on the number of vertices). This parameter should be between 0 and 1. By default, TIES is used as sampling algorithm. This can be changed by adding the -a option with randomedge, randomnode, ties or forestfire -v {sourcevertex}. If forest fire is used, replace {sourcevertex} with the source vertex.

Example scaling down: ./graph_scaling_tool -i /home/user/graph.txt -o /home/user/graph_output -s 0.4 -a ties

To directly perform a scaling up operation, the following parameters are required:

  • Scaling factor -u (based on the number of vertices)
  • Sample size -s for each sampled graph. This value should be between 0 and 1.
  • Sampling algorithm -a either: ties, randomedge, randomnode or forestfire -v {sourcevertex}.
  • Topology -t either: star, chain, ring or fullyconnected
  • Bridging type -b, either: random or high (degrees)
  • Number of interconnections between each sample: -n
  • Force directed bridges -d. Setting this to true would add directed bridges.

Example scaling up: ./graph_scaling_tool -i /home/user/graph.txt -o /home/user/graph_output -u 3.0 -s 0.5 -a ties -d false -n 10 -b random -t star

Parallel & distributed

Switch to the 'distributed' branch; compile the project using:

make release

Run on a cluster with:

prun -np <number of compute nodes> -v -1 -reserve <reservation id> -sge-script $PRUN_ETC/prun-openmpi sample <input file> <output file> <sampling fraction> <CPU mem limit (bytes, per node)> <GPU mem limit (bytes, per node)>

Example with 2 nodes, "wiki" input file, 0.5 sampling fraction, 32 GB CPU memory and 12 GB GPU memory limits:

prun -np 2 -v -1 -reserve <reservation id> -sge-script $PRUN_ETC/prun-openmpi sample OUTPUT_PATH 0.5 34359738368 12884901888

Note that the input file must be an edge list, ordered by source vertex, and then by destination vertex. The number of vertices and edges in the graph must be denoted at the beginning of the file, on a separate rule, like: "! nVertices nEdges" (without quotes) The input graph should not have 'missing' vertices (vertices without incoming/outgoing edges and are therefore not present in the input file).

Owner

  • Name: AJ
  • Login: amusaafir
  • Kind: user
  • Location: The Netherlands

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
preferred-citation:
  type: "conference-paper"
  title: "A Sampling-Based Tool for Scaling Graph Datasets"
  conference:
    name: "International Conference on Performance Engineering"
  doi: "10.1145/3358960.3379144"
  year: "2020"
  start: "289"
  end: "300"
  authors:
  - family-names: "Musaafir"
    given-names: "Ahmed"
  - family-names: "Uta"
    given-names: "Alexandru"
  - family-names: "Dreuning"
    given-names: "Henk"
  - family-names: "Varbanescu"
    given-names: "Ana"

GitHub Events

Total
Last Year