pagerank-minimize-inequality

Comparison of heuristics for minimization of inequality in ranks of vertices obtained with the PageRank algorithm.

https://github.com/puzzlef/pagerank-minimize-inequality

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Comparison of heuristics for minimization of inequality in ranks of vertices obtained with the PageRank algorithm.

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 4 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Comparison of heuristics for minimization of inequality in ranks of vertices obtained with the PageRank algorithm.

Gini coefficient is a value which represents income/wealth inequality within a nation or group. It ranges from 0 to 1, with 0 representing total equality and 1 representing total inequality. It is calculated from the Lorenz curve, which plots cumulative income/wealth against cumulative number of households/people. PageRank is a random-walk based metric which ranks webpages based on the idea that popular pages are pointed to by other popular pages. This is one of the important metrics used by search engines to order the results of a query.

This ranking affects our decision to visit a website, and thus indirectly controls the amount of traffic to a site, and by extension, the amount of money made by the site (through ads, sales, or investments). While there is nothing wrong with this (as we want to visit more useful websites), it also ends up skewing the future linking behavior of the web in favor of already popular sites. This is a positive feedback loop (rich get richer), and it further enhances the inequality that already exists on the web. High levels of inequality are not sustainable, and beyond a tipping point can lead to significant social unrest. This motivates us to study inequality in ranking (based on PageRank) on graphs, and find ways to keep it under control.


Adjusting Dead end handling strategy

In this experiment (adjust-dead-end-strategy) we study the Lorenz curve and Gini coefficient of PageRank values on a number of graphs, and compare between PageRank values obtained with three different dead-end handling strategies: teleport from dead-ends (default), self-loop dead-ends (loop), and self-loop all vertices (loopall). The PageRank values of vertices in each graph is obtained with nvgraph.sh, which internally uses nvGraph PageRank. The lorenz curve of ranks is obtained by sorting the ranks in ascending order and cumulatively summing them up to obtain 100 samples. These 100 samples are then compared with the ideal (total equality) lorenz curve to obtain the gini coefficient. Note that this is output into YAML files by nvgraph.sh itself. This measurement process of lorenz curve and gini coefficient is repeated for loop and loopall variants of graphs, which are generated from the original graph using graph-generate. Finally we process all YAML files into CSV, separately for gini coefficient and lorenz curve, and compare the results.

Results indicate that web graphs in general (except web-NotreDame) have high gini coefficient (i.e., high inequality) along with a social network soc-Epinions1, and a citation network cit-Patents. Road networks are observed to have the lowest gini coefficient (i.e., low inequality) among all graph classes. If we take a look at the average lorenz curve of all graphs, we observe that 50% of popularity (ranks) is owned by ~20% of the vertices. However, on the web-Stanford graph 50% of popularity is owned by only ~3% of vertices, and on the arabic-2005 (another web graph) is owned by only ~1% of the vertices. This would be a significantly worrying level of inequality if each vertex represented a unique person. However, it is quite likely that many low-ranked pages are low-effort ones and thus have a high page-to-person ratio.

On the social network soc-Epinions1, 50% of popularity is owned by only ~7% of vertices (gini coefficient of ~0.66), but on the wiki-Talk (a communication graph) 50% of popularity is owned by ~46% of vertices (gini coefficient of ~0.07). This is quite interesting, given that wiki users are usually not ranked, while search engines always rank web pages. Road networks, such as germany_osm, are observed to have have a distribution similar to that of wiki-Talk. Future work could focus on studying the variation of the lorenz curve and gini coefficient of various graphs over time.


Comparing Deterministic approaches

This experiment compares six deterministic heuristics for inequality minimization: 1. Add an edge from the vertex with highest contribution, to the one with lowest rank (edgeInsertCxrx). 2. Add an edge from the vertex with highest contribution, to the one with highest reverse rank (edgeInsertCxSx). 3. Add an edge from the vertex with highest contribution, to the one with highest reverse rank but mostly linking to low rank vertices (edgeInsertCxSr). 4. Add an edge from the vertex with highest contribution to the highest rank vertex, to the one with lowest rank (edgeInsertCRrx). 5. Add an edge from the vertex with highest contribution to the highest rank vertex, to the one with highest reverse rank (edgeInsertCRSx). 6. Add an edge from the vertex with highest contribution to the highest rank vertex, to the one with highest reverse rank but mostly linking to low rank vertices (edgeInsertCRSr).

In this experiment (approach-deterministic) we study the minimization of Gini coefficient of PageRank values on a number of graphs, using six different deterministic heuristics for adding edges to the graph. First, the PageRank of each vertex is computed in the original graph, and the original gini coefficient is obtained. A heuristic is then run to obtain the most suitable edge to be added. After this edge is added, the same heuristic is run again. For each heuristic 1000 edges are added. We plot the variation of gini coefficient with each added edge for each heuristic.

Our first heuristic called edgeInsertCxrx adds an edge between the (current) highest contributing vertex to the lowest rank vertex. The idea behind this heuristic is to provide the highest possible increase in rank to the lowest rank vertex. We obtained the highest contributing vertex by finding the vertex with highest R/d+1 value.

The second heuristic called edgeInsertCxSx is based on the idea of providing the highest possible increase in rank to a vertex which directly or indirectly links to many other vertices (so that it increases the rank of a large number of other vertices as well). This is achieved by adding an edge from the highest contributing vertex to the vertex with highest reverse PageRank. Here, the reverse PageRank of a vertex is obtained by reversing (transposing) the graph, and calculating the PageRanks.

The third heuristic called edgeInsertCxSr is an extension of edgeInsertCxSx, and it prioritizes increasing the rank of vertices which link (directly or indirectly) to a large number of vertices having a low PageRank score. This is done by calculating a modified reverse PageRank, that prioritizes contribution from vertices with low forward PageRank. Here, the reverse rank of each vertex is calculated as rᵤ = αRᵤrᵥ/dᵥ + (1-α)/N, where rᵤ is the reverse rank of a given vertex and Rᵤ is its forward rank (precomputed), rᵥ is the reverse rank of a target vertex and dᵥ is its in-degree, α is the damping factor, and N is the number of vertices in the graph.

The remaining three heuristics edgeInsertCRrx, edgeInsertCRSx, and edgeInsertCRSr are a variation of the three heuristics mentioned above where the source vertex is chosen such that it minimizes the rank of the highest ranked vertex. That is, we choose the source vertex with highest contribution to the highest rank vertex. The idea is to reduce rank of high-ranked vertices and increase the rank of low-ranked vertices at the same time, thus reducing inequality.

It is observed that web graphs tend to have the highest inequality (gini coefficient), while road networks tend to have the lowest. Results indicate that the heuristics unsually succeed in reducing inequality on graphs with high gini coefficient (such as web graphs and social networks), but mostly fail on graphs with low gini coefficient (such as road networks and collaboration networks). It is also observed that the rate of decrease in gini coefficient decreases as more and more edges are added to graph. In general, we observe that the heuristics edgeInsertCxrx, edgeInsertCxSx, and edgeInsertCxSr perform the best, with edgeInsertCxSx, and edgeInsertCxSr performing almost identically. edgeInsertCxrx and edgeInsertCxSx heuristics would therefore be the best choices, given that edgeInsertCxSr requires a modified PageRank computation.

Based on these results, a suitable approach to minimizing inequality would be to apply both the edgeInsertCxrx and edgeInsertCxSx heuristics and choose the the best among them for each edge addition. Future research work can include exploring randomized heuristics or looking for better deterministic heuristics.


Other experiments



References




ORG DOI

Owner

  • Name: puzzlef
  • Login: puzzlef
  • Kind: organization

A summary of experiments.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Sahu
    given-names: Subhajit
    orcid: https://orcid.org/0000-0001-5140-6578
title: "puzzlef/pagerank-inequality-minimize-deterministic: Comparison of deterministic heuristics for minimization of inequality in ranks of vertices"
version: 1.0.0
doi: 10.5281/zenodo.6806085
date-released: 2022-07-07

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels