copra-communities

Single-threaded CPU-based Community OVerlap PRopagation Algorithm (COPRA) for community detection.

https://github.com/puzzlef/copra-communities

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, iop.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Single-threaded CPU-based Community OVerlap PRopagation Algorithm (COPRA) for community detection.

Basic Info
  • Host: GitHub
  • Owner: puzzlef
  • License: mit
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 75.2 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 3 years ago · Last pushed about 3 years ago
Metadata Files
Readme License Citation

README.md

Single-threaded CPU-based Community OVerlap PRopagation Algorithm (COPRA) for community detection.

This is an implementation of a label-propagation based community detection algorithm called Community OVerlap PRopagation Algorithm (COPRA). Unlike RAK, this algorithm uses multiple labels per vertex, with each label having an associated belonging coefficient (which sums to 1). The algorithm is as follows.

  1. Each vertex initializes as its own community (belonging=1).
  2. Each iteration, a vertex collects labels from its neighborhood.
  3. I excludes a vertex's own labels, although not explicitly mentioned in paper.
  4. The collected labels are scaled by edge weights for weighted graph.
  5. Each vertex picks labels above a certain threshold.
  6. This threshold is inversely proportional to the max. number of labels.
  7. If all labels are below threshold, pick a random best label.
  8. I make a vertex join its own community if it has no labels (not mentioned).
  9. Selected labels are normalized such that belonging coefficient sums to 1.
  10. Repeat from 2 until convergence.

The authors of this paper mention to use a mimimum vertices per community count to detect convergence. However i do not find it to be helpful. I instead use a similar convergence condition as RAK, that is count the number of vertices that change thier best label. Once this count falls below a certain fraction (tolerance), i consider the algorithm to have converged. The authors also mention using a sychrounous version of the algorithm, where labels of each vertex is dependent only upon labels in previous iteration. However, i find asynchronous approach to be converge faster (labels of each vertex can be dependent upon labels in current iteration). As we focus of finding disjoint communities, i consider the best label of each vertex as the final result.



I vary the tolerance from 0.1 to 0.0001 (as with RAK), and adjust the max. number of labels from 1 to 32.



On average, it seems that using a single label is best in terms of time as well as modularity. This is particulary true in case of road networks, but not so in case of other classes of graphs (web graphs, social networks, collaboration networks). For example, web graphs such as web-Stanford and web-BerkStan achieve best modularity with max. labels of 8, web-Google does best with max. labels of 32, and web-NotreDame does best with max. labels of 16. Max. labels of 4-16 would be a good choice for such graphs.



In addition it seems that on average, making the tolerance tighter than 0.01 has no beneficial effect on modularity. However, tighter tolerance does not help with graphs such as web-NotreDame, coAuthorsDBLP, and social networks. It seems a tolerance of 0.01 would be a good choice on average.



Both RAK and COPRA approaches can obtain disconnected communities. This issue can be resolved by splitting such communities into separate communities in a post-processing step. I do not do that here. We may do it when comparing these approaches.

All outputs are saved in a gist and a small part of the output is listed here. Some charts are also included below, generated from sheets. The input data used for this experiment is available from the SuiteSparse Matrix Collection. This experiment was done with guidance from Prof. Kishore Kothapalli and Prof. Dip Sankar Banerjee.


```bash $ g++ -std=c++17 -O3 main.cxx $ ./a.out ~/data/web-Stanford.mtx $ ./a.out ~/data/web-BerkStan.mtx $ ...

Loading graph /home/subhajit/data/web-Stanford.mtx ...

order: 281903 size: 2312497 [directed] {}

order: 281903 size: 3985272 [directed] {} (symmetricize)

[-0.000497 modularity] noop

[00206.635 ms; 0003 iters.; 0.836576819 modularity] copraSeqStatic {labels=01, tolerance=1e-01}

[00248.339 ms; 0003 iters.; 0.838712633 modularity] copraSeqStatic {labels=02, tolerance=1e-01}

[00369.730 ms; 0003 iters.; 0.875163972 modularity] copraSeqStatic {labels=04, tolerance=1e-01}

[00550.753 ms; 0003 iters.; 0.878541648 modularity] copraSeqStatic {labels=08, tolerance=1e-01}

[00675.437 ms; 0003 iters.; 0.881162941 modularity] copraSeqStatic {labels=16, tolerance=1e-01}

[01130.822 ms; 0003 iters.; 0.866800904 modularity] copraSeqStatic {labels=32, tolerance=1e-01}

[00206.794 ms; 0003 iters.; 0.836576819 modularity] copraSeqStatic {labels=01, tolerance=5e-02}

[00243.418 ms; 0003 iters.; 0.838712633 modularity] copraSeqStatic {labels=02, tolerance=5e-02}

[00475.227 ms; 0004 iters.; 0.887068212 modularity] copraSeqStatic {labels=04, tolerance=5e-02}

[00834.529 ms; 0005 iters.; 0.892077446 modularity] copraSeqStatic {labels=08, tolerance=5e-02}

[01010.437 ms; 0005 iters.; 0.897701085 modularity] copraSeqStatic {labels=16, tolerance=5e-02}

[01706.960 ms; 0005 iters.; 0.885300398 modularity] copraSeqStatic {labels=32, tolerance=5e-02}

...

```



References




ORG DOI

Owner

  • Name: puzzlef
  • Login: puzzlef
  • Kind: organization

A summary of experiments.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Sahu
    given-names: Subhajit
    orcid: https://orcid.org/0000-0001-5140-6578
title: "ionicf/copra-communities-seq: Single-threaded CPU-based Community OVerlap PRopagation Algorithm (COPRA) for community detection"
version: 1.0.0
doi: 10.5281/zenodo.7538202
date-released: 2023-01-15

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels