pagerank-dynamic
Design of Dynamic PageRank algorithm for link analysis.
Science Score: 41.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Keywords
Repository
Design of Dynamic PageRank algorithm for link analysis.
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Design of Dynamic PageRank algorithm for link analysis.
All seven graphs (temporal) used in our experiments are stored in a plain
text file in u, v, t format, where u is the source vertex, v is the
destination vertex, and t is the UNIX epoch time in seconds. These
include: CollegeMsg, email-Eu-core-temporal, sx-mathoverflow,
sx-askubuntu, sx-superuser, wiki-talk-temporal, and sx-stackoverflow.
We obtained them from the Stanford Large Network Dataset Collection.
If initial ranks are not provided, they are set to 1/N. Error check
is done using L1 norm with static PageRank (without initial ranks). The
experiment is implemented in C++, and compiled using GCC 9 with
optimization level 3 (-O3). The system we use is a Dell PowerEdge R740 Rack
server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4
Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running
CentOS Linux release 7.9.2009 (Core). The iterations taken with each test
case is measured. 500 is the maximum iterations allowed. We print the
statistics of each test case to standard output (stdout), and redirect to a
log file, which we then process with a script to generate a CSV file,
with each row representing the details of a single test case. This CSV file
is imported into Google Sheets, and necessary tables are set up with the
help of the FILTER function to create the charts.
Comparing strategies to Update ranks
There are a number of strategies to set up the initial rank vector for the
PageRank algorithm on the updated graph, using the ranks from the old graph.
If a graph update does not end up adding any new vertices, then it is simply
a matter of running PageRank upon the updated graph with the previous ranks,
as the initial ranks. If however, some new vertices have been added, and/or
some old vertices have been removed, one of the following strategies may be
used to adjust the initial ranks: zero-fill, 1/N-fill, scaled zero-fill,
or scaled 1/N-fill. The zero-fill strategy is the simplest, and consists
of simply filling the ranks of new vertices with 0 (i.e. r₁ = 0 for new
vertices). The 1/N-fill strategy is similar, ranks of new vertices are
filled with 1/N₁ (i.e. r₁ = 1/N₁ for new vertices). The scaled zero-fill
strategy extends zero-fill, and additionally scales ranks of old vertices
with a factor of N₀/N₁ (i.e. r₁ = r₀ × N₀/N₁ for old vertices, and r₁ = 0
for new vertices). Finally, the scaled 1/N-fill strategy is a combination of
scaling old vertices, and 1/N-fill (i.e. r₁ = r₀ × N₀/N₁ for old vertices,
and r₁ = 1/N₁ for new vertices). Here, N₀ is the total number of vertices,
and r₀ is the rank of a given vertex in the old graph. On the other hand,
N₁ is the total number of vertices, and r₁ is the rank of a given vertex
in the updated graph.
For static PageRank computation of a graph, the initial ranks of all vertices
are set to 1/N, where N is the total number of vertices in the graph.
Therefore, the sum of all initial ranks equals 1, as it should for a
probability (rank) vector. It is important to note however that this is not
an essential condition, rather, it likely helps with faster convergence.
With scaled rank adjustment strategies, unlike unscaled ones, PageRank computation on unaffected vertices can be skipped , as long as the graph has no affected dead ends (including dead ends in the old graph). Scaling is necessary, because even though the importance of unaffected vertices does not change, the final rank vector is a probabilistic vector and must sum to 1. Here, affected vertices are those which are either changed vertices, or are reachable from changed vertices. Changed vertices are those which have an edge added or removed between it and another (changed) vertex.
We conduct an experiment (adjust-ranks) with each rank adjustment strategy on various temporal graphs, updating each graph with multiple batch sizes (10¹, 5×10¹, 10², ...), until the entire graph is processed. For each batch size, static PageRank is computed, along with incremental PageRank based on each of the four rank adjustment strategies, without skipping unaffected vertices. This is done in order to get an estimate of the convergence rate of each rank adjustment strategy, independent of the number of skipped vertices (which can differ based on the dead end handling strategy used).
We perform each rank adjustment strategy using a common adjustment function
that adds a value, then multiplies a value to old ranks, and sets a value
for new ranks. After ranks are adjusted, they are set as initial ranks for
PageRank computation, which is then run on all vertices (no vertices are
skipped). The PageRank algorithm used is the standard power-iteration (pull)
based that optionally accepts initial ranks. The rank of a vertex in an
iteration is calculated as c₀ + αΣrₑ/dₑ, where c₀ is the common teleport
contribution, α is the damping factor (0.85), rₑ is the previous rank
of vertex with an incoming edge, dₑ is the out-degree of the incoming-edge
vertex, and N is the total number of vertices in the graph. The common
teleport contribution c₀, calculated as (1-α)/N + αΣrₙ/N, includes the
contribution due to a teleport from any vertex in the graph due to the damping
factor (1-α)/N, and teleport from dangling vertices (with no outgoing
edges) in the graph αΣrₙ/N. This is because a random surfer jumps to a random
page upon visiting a page with no links, in order to avoid the rank-sink effect.
From the results, We observe that all of the rank adjustment strategies with
incremental PageRank perform significantly better than static PageRank, with a
small performance difference among them. However, for large batch sizes,
static PageRank is able to beat all of the rank adjustment strategies. This
is expected, since beyond a certain batch size, computing PageRank from
scratch is going to be faster than incremental/dynamic PageRank. On average, the
performance difference between the rank adjustment strategies increases
as the batch size increases. Both 1/N-fill, and scaled 1/N-fill strategies
(having similar performance curve) require somewhat fewer iterations to
converge than zero-fill, and scaled zero-fill strategies (also with similar
performance curve). This is possibly because the 1/N value for new vertices
is closer to their actual rank value, than 0.
On average, on all graphs, both the 1/N-fill , and scaled 1/N-fill
strategies seem to perform the best. Based on GM-RATIO comparison, the
relative iterations between zero-fill, 1/N-fill, scaled zero-fill, and
scaled 1/N-fill is 1.00 : 0.96 : 1.00 : 0.96 for all batch sizes. Hence,
both 1/N-fill and scaled 1/N-fill are 4% faster (1.04x) than zero-fill
and scaled zero-fill. Here, GM-RATIO is obtained by taking the geometric
mean (GM) of iterations taken at different stages of the graph on all graphs,
with each specific batch size. Then, GM is taken for all batch sizes, and a
ratio is obtained relative to the zero-fill strategy. Based on AM-RATIO
comparison, the relative iterations between zero-fill, 1/N-fill, scaled
zero-fill, and scaled 1/N-fill is 1.00 : 0.95 : 1.00 : 0.95 (all batch
sizes). Hence, both 1/N-fill and scaled 1/N-fill are 5% faster (1.05x)
than zero-fill and scaled zero-fill. AM-RATIO is obtained in a process
similar to that of GM-RATIO, except that arithmetic mean (AM) is used
instead of GM.
Among the four studied rank adjustment strategies for incremental/dynamic PageRank, both 1/N-fill and scaled 1/N-fill appear to be the best. However, only with the scaled 1/N-fill strategy (also scaled zero-fill, but it is slower), it is possible to skip PageRank computation on unaffected vertices , as long as the graph has no affected dead ends (including dead ends in the old graph). The scaled 1/N-fill rank adjustment strategy, which is commonly used for dynamic PageRank, is thus the way to go.
Comparison with Static approach
In this experiment (compare-static), we compare the performance between
finding static and dynamic PageRank of updated graph. Both approaches are
attempted on a number of temporal graphs, running each with multiple batch sizes
(1, 5, 10, 50, ...). New edges are incrementally added to the graph
batch-by-batch until the entire graph is complete. Dynamic PageRank is
clearly faster than static PageRank for most cases. All outputs are saved
in gist. Some charts are also included below, generated from sheets.
Other experiments
References
- PageRank Algorithm, Mining massive Datasets (CS246), Stanford University
- Stanford Large Network Dataset Collection
Owner
- Name: puzzlef
- Login: puzzlef
- Kind: organization
- Website: https://puzzlef.github.io/
- Repositories: 10
- Profile: https://github.com/puzzlef
A summary of experiments.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Sahu
given-names: Subhajit
orcid: https://orcid.org/0000-0001-5140-6578
title: "puzzlef/pagerank-incremental-adjust-ranks: Comparing strategies to update ranks for incremental/dynamic PageRank"
version: 1.0.0
doi: 10.5281/zenodo.5598717
date-released: 2021-10-26
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0














