pagerank-minimize-inequality
Comparison of heuristics for minimization of inequality in ranks of vertices obtained with the PageRank algorithm.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.6%) to scientific vocabulary
Repository
Comparison of heuristics for minimization of inequality in ranks of vertices obtained with the PageRank algorithm.
Basic Info
- Host: GitHub
- Owner: puzzlef
- License: mit
- Language: C++
- Default Branch: main
- Homepage: https://arxiv.org/abs/2310.18537
- Size: 145 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Comparison of heuristics for minimization of inequality in ranks of vertices obtained with the PageRank algorithm.
Gini coefficient is a value which represents income/wealth inequality
within a nation or group. It ranges from 0 to 1, with 0 representing
total equality and 1 representing total inequality. It is calculated from
the Lorenz curve, which plots cumulative income/wealth against cumulative
number of households/people. PageRank is a random-walk based metric which
ranks webpages based on the idea that popular pages are pointed to by
other popular pages. This is one of the important metrics used by search
engines to order the results of a query.
This ranking affects our decision to visit a website, and thus indirectly controls the amount of traffic to a site, and by extension, the amount of money made by the site (through ads, sales, or investments). While there is nothing wrong with this (as we want to visit more useful websites), it also ends up skewing the future linking behavior of the web in favor of already popular sites. This is a positive feedback loop (rich get richer), and it further enhances the inequality that already exists on the web. High levels of inequality are not sustainable, and beyond a tipping point can lead to significant social unrest. This motivates us to study inequality in ranking (based on PageRank) on graphs, and find ways to keep it under control.
Adjusting Dead end handling strategy
In this experiment (adjust-dead-end-strategy) we study the Lorenz curve
and Gini coefficient of PageRank values on a number of graphs, and
compare between PageRank values obtained with three different dead-end
handling strategies: teleport from dead-ends (default), self-loop dead-ends
(loop), and self-loop all vertices (loopall). The PageRank values of
vertices in each graph is obtained with nvgraph.sh, which internally uses
nvGraph PageRank. The lorenz curve of ranks is obtained by sorting the
ranks in ascending order and cumulatively summing them up to obtain 100
samples. These 100 samples are then compared with the ideal (total equality)
lorenz curve to obtain the gini coefficient. Note that this is output
into YAML files by nvgraph.sh itself. This measurement process of
lorenz curve and gini coefficient is repeated for loop and loopall
variants of graphs, which are generated from the original graph using
graph-generate. Finally we process all YAML files into CSV, separately for
gini coefficient and lorenz curve, and compare the results.
Results indicate that web graphs in general (except web-NotreDame) have
high gini coefficient (i.e., high inequality) along with a social
network soc-Epinions1, and a citation network cit-Patents. Road
networks are observed to have the lowest gini coefficient (i.e., low
inequality) among all graph classes. If we take a look at the average lorenz
curve of all graphs, we observe that 50% of popularity (ranks) is owned by
~20% of the vertices. However, on the web-Stanford graph 50% of popularity
is owned by only ~3% of vertices, and on the arabic-2005 (another web graph)
is owned by only ~1% of the vertices. This would be a significantly worrying
level of inequality if each vertex represented a unique person. However, it
is quite likely that many low-ranked pages are low-effort ones and thus have
a high page-to-person ratio.
On the social network soc-Epinions1, 50% of popularity is owned by only
~7% of vertices (gini coefficient of ~0.66), but on the wiki-Talk (a
communication graph) 50% of popularity is owned by ~46% of vertices (gini
coefficient of ~0.07). This is quite interesting, given that wiki users are
usually not ranked, while search engines always rank web pages. Road
networks, such as germany_osm, are observed to have have a distribution
similar to that of wiki-Talk. Future work could focus on studying the
variation of the lorenz curve and gini coefficient of various graphs
over time.
Comparing Deterministic approaches
This experiment compares six deterministic heuristics for inequality minimization: 1. Add an edge from the vertex with highest contribution, to the one with lowest rank (edgeInsertCxrx). 2. Add an edge from the vertex with highest contribution, to the one with highest reverse rank (edgeInsertCxSx). 3. Add an edge from the vertex with highest contribution, to the one with highest reverse rank but mostly linking to low rank vertices (edgeInsertCxSr). 4. Add an edge from the vertex with highest contribution to the highest rank vertex, to the one with lowest rank (edgeInsertCRrx). 5. Add an edge from the vertex with highest contribution to the highest rank vertex, to the one with highest reverse rank (edgeInsertCRSx). 6. Add an edge from the vertex with highest contribution to the highest rank vertex, to the one with highest reverse rank but mostly linking to low rank vertices (edgeInsertCRSr).
In this experiment (approach-deterministic) we study the minimization of Gini coefficient of PageRank values on a number of graphs, using six different deterministic heuristics for adding edges to the graph. First, the PageRank of each vertex is computed in the original graph, and the original gini coefficient is obtained. A heuristic is then run to obtain the most suitable edge to be added. After this edge is added, the same heuristic is run again. For each heuristic 1000 edges are added. We plot the variation of gini coefficient with each added edge for each heuristic.
Our first heuristic called edgeInsertCxrx adds an edge between the (current)
highest contributing vertex to the lowest rank vertex. The idea behind this
heuristic is to provide the highest possible increase in rank to the lowest rank
vertex. We obtained the highest contributing vertex by finding the vertex with
highest R/d+1 value.
The second heuristic called edgeInsertCxSx is based on the idea of providing the highest possible increase in rank to a vertex which directly or indirectly links to many other vertices (so that it increases the rank of a large number of other vertices as well). This is achieved by adding an edge from the highest contributing vertex to the vertex with highest reverse PageRank. Here, the reverse PageRank of a vertex is obtained by reversing (transposing) the graph, and calculating the PageRanks.
The third heuristic called edgeInsertCxSr is an extension of
edgeInsertCxSx, and it prioritizes increasing the rank of vertices which link
(directly or indirectly) to a large number of vertices having a low PageRank
score. This is done by calculating a modified reverse PageRank, that
prioritizes contribution from vertices with low forward PageRank. Here, the
reverse rank of each vertex is calculated as rᵤ = αRᵤrᵥ/dᵥ + (1-α)/N, where
rᵤ is the reverse rank of a given vertex and Rᵤ is its forward rank
(precomputed), rᵥ is the reverse rank of a target vertex and dᵥ is its
in-degree, α is the damping factor, and N is the number of vertices in the
graph.
The remaining three heuristics edgeInsertCRrx, edgeInsertCRSx, and edgeInsertCRSr are a variation of the three heuristics mentioned above where the source vertex is chosen such that it minimizes the rank of the highest ranked vertex. That is, we choose the source vertex with highest contribution to the highest rank vertex. The idea is to reduce rank of high-ranked vertices and increase the rank of low-ranked vertices at the same time, thus reducing inequality.
It is observed that web graphs tend to have the highest inequality (gini coefficient), while road networks tend to have the lowest. Results indicate that the heuristics unsually succeed in reducing inequality on graphs with high gini coefficient (such as web graphs and social networks), but mostly fail on graphs with low gini coefficient (such as road networks and collaboration networks). It is also observed that the rate of decrease in gini coefficient decreases as more and more edges are added to graph. In general, we observe that the heuristics edgeInsertCxrx, edgeInsertCxSx, and edgeInsertCxSr perform the best, with edgeInsertCxSx, and edgeInsertCxSr performing almost identically. edgeInsertCxrx and edgeInsertCxSx heuristics would therefore be the best choices, given that edgeInsertCxSr requires a modified PageRank computation.
Based on these results, a suitable approach to minimizing inequality would be to apply both the edgeInsertCxrx and edgeInsertCxSx heuristics and choose the the best among them for each edge addition. Future research work can include exploring randomized heuristics or looking for better deterministic heuristics.
Other experiments
References
- Inequality and inequity in network‑based ranking and recommendation algorithms
- Understanding the Gini Coefficient
- Gini Coefficient and Lorenz Curve
- Deeper Inside PageRank
- PageRank Algorithm, Mining massive Datasets (CS246), Stanford University
- SuiteSparse Matrix Collection
Owner
- Name: puzzlef
- Login: puzzlef
- Kind: organization
- Website: https://puzzlef.github.io/
- Repositories: 10
- Profile: https://github.com/puzzlef
A summary of experiments.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Sahu
given-names: Subhajit
orcid: https://orcid.org/0000-0001-5140-6578
title: "puzzlef/pagerank-inequality-minimize-deterministic: Comparison of deterministic heuristics for minimization of inequality in ranks of vertices"
version: 1.0.0
doi: 10.5281/zenodo.6806085
date-released: 2022-07-07
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0

















