CM++ - A Meta-method for Well-Connected Community Detection

CM++ - A Meta-method for Well-Connected Community Detection - Published in JOSS (2024)

https://github.com/illinois-or-research-analytics/cm_pipeline

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 18 DOI reference(s) in README
  • Academic publication links
    Links to: joss.theoj.org, zenodo.org
  • Committers with academic emails
    5 of 11 committers (45.5%) from academic institutions
  • Institutional organization owner
    Organization illinois-or-research-analytics has institutional domain (grainger.illinois.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary

Scientific Fields

Mathematics Computer Science - 40% confidence
Last synced: 6 months ago · JSON representation

Repository

Pipeline that uses an improved version of CM for generating well-connected graph clusters

Basic Info
  • Host: GitHub
  • Owner: illinois-or-research-analytics
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 60.7 MB
Statistics
  • Stars: 5
  • Watchers: 3
  • Forks: 5
  • Open Issues: 10
  • Releases: 16
Created about 3 years ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

CM++ Pipeline

DOI DOI

Customizable modular pipeline for testing an improved version of CM for generating well-connected clusters. Image below from arXiv preprint: Park et. al. (2023). https://github.com/illinois-or-research-analytics/cm_pipeline/tree/main with a GUI now available!

cm_pipeline Overview Figure: CM Pipeline Overview The Connectivity Modifier pipeline starts with an input network and a clustering algorithm. In the first step, an initial clustering on the entire network is obtained. Afterwards, clusters below size $B$ (default: $B=11$ ) and tree clusters are removed from this initial clustering. On this filtered clustering, each cluster is processed as follows. If a cluster has an edge cut below the threshold (default: $\log_{10}{n}$ where $n$ is the number of nodes in the cluster) the edge cut is removed and the two pieces are re-clustered. This process repeats until all clusters are well-connected. The final processing removes small clusters. Note the user can change the value for $B$ and the threshold for connectivity as these are user-defined. The user has the option to not apply the recursive clustering. See information on CC and WCC.

Documentation

For the full documentation see here

Overview

Main Features

By default, CM modifies an input clustering to ensure that each cluster is well-connected. CM does this by doing rounds of mincut and clustering. It is also possible to run CM in a way that does not recursively cluster (CC and WCC).

Default CM:

CM under default settings of $B=11$ and threshold= $\log_{10}{n}$ where $n$ is the number of nodes in the cluster, meaning remove tree clusters and clusters below size $B$ and also ensure that each cluster has a minimum edge cut size greater than the threshold.

Click to expand example command

  • command: python -m main pipeline.json
  • pipeline.json:

{ "title": <custom name for this run>, "name": <custom name of your network>, "input_file": <path to your network edgelist>, "output_dir": <output directory>, "algorithm": <clustering algorithm e.g., ikc, leiden, leiden_mod>, "params": [ { <parameter name e.g., res, i>: <parameter value> } ], "stages": [ { "name": "cleanup" }, { "name": "clustering", "parallel_limit": 2 }, { "name": "stats", "parallel_limit": 2 }, { "name": "filtering", "scripts": [ "./scripts/subset_graph_nonetworkit_treestar.R", "./scripts/make_cm_ready.R" ] }, { "name": "connectivity_modifier", "memprof": <boolean for whether to profile memory e.g., true or false>, "threshold": <well-connectedness threshold e.g., 1log10>, "nprocs": <number of processors for parallelism>, "quiet": <boolean whether to print outputs to console e.g., true or false> }, { "name": "filtering", "scripts": [ "./scripts/post_cm_filter.R" ] }, { "name": "stats", "parallel_limit": 2 } ] }

CM without removing small clusters or tree clusters

Click to expand example command

  • command: python -m main pipeline.json
  • pipeline.json:

    { "title": <custom name for this run>, "name": <custom name of your network>, "input_file": <path to your network edgelist>, "output_dir": <output directory>, "algorithm": <clustering algorithm e.g., ikc, leiden, leiden_mod>, "params": [ { <parameter name e.g., res, i>: <parameter value> } ], "stages": [ { "name": "cleanup" }, { "name": "clustering", "parallel_limit": 2 }, { "name": "stats", "parallel_limit": 2 }, { "name": "connectivity_modifier", "memprof": <boolean for whether to profile memory e.g., true or false>, "threshold": <well-connectedness threshold e.g., 1log10>, "nprocs": <number of processors for parallelism>, "quiet": <boolean whether to print outputs to console e.g., true or false> }, { "name": "stats", "parallel_limit": 2 } ] }

WCC (Well Connected Components)

This is equivalent to running CM pipeline with the pre and post filtering steps turned off and not applying clustering recursively. Equivalently, each cluster is repeatedly cut until it is well-connected. The user has the option of adding in the filtering steps if desired.

Click to expand example command

  • command: python3 -m hm01.cm -i <input network edgelist path> -e <input existing clustering path> -o <output filepath> -c nop --threshold <threshold e.g., 1log10> --nprocs <number of processors>

CC (Connected Components)

Some clustering methods produce disconnected clusters. CC will take such a clustering and returns the connected components of these clusters. The user has the option of adding in the filtering steps if desired.

Click to expand example command

  • command: python3 -m hm01.cm -i <input network edgelist path> -e <input existing clustering path> -o <output filepath> -c nop --threshold 0.1 --nprocs <number of processors>

Clustering methods that can be used with CM

CM currently enables the use of Leiden optimizing the constant Potts Model, Leiden optimizing under modularity, Iterative K-Core, Markov CLustering, Infomap, and Stochastic Block Models. Example for Stochastic Block Model shown below.

Click to expand example command

  • command: python3 -m hm01.cm -i <input network edgelist path> -e <input existing clustering path> -o <output filepath> -c external -cargs cargs.json -cfile <clusterer file path e.g., path to hm01/clusterers/external_clusterers/sbm_wrapper.py> --threshold <threhsold e.g., 1log10> --nprocs <number of processors>
  • cargs.json: { <param key e.g., "block_state">: <param value e.g., "non_nested_sbm", "planted_partition_model"> <param key 2 e.g., "degree_corrected">: <param value e.g. true, false> }

CM Pipeline Overview

The CM Pipeline is a modular pipeline for community detection that contains the following modules:

  • Graph Cleaning: Removal of parallel and duplicate edges as well as self loops

  • Community Detection: Clusters an input network with one of Leiden, IKC, and InfoMap.

  • Cluster Filtration: A pre-processing stage that allows users to filter out clusters that are trees or have size below a given threshold.

  • Community Statistics Reporting: Generates node and edge count, modularity score, Constant Potts Model score, conductance, and edge-connectivity at multiple stages.

  • Extensibility: Developers can design new stages and wire in new algorithms. Please see the following document for instructions on how to expand the list of supported clustering algorithms as a developer.

  • CM++

CM++ Overview

CM++ is a module within the CM Pipeline, having the following features:

  • Function: CM++ refines your existing graph clustering by carving them into well-connected clusters with high minimum cut values.

  • Flexibility: Users can accompany their definition of a good community with well-connectedness. CM++ works with any clustering algorithm and presently provides build in support for Leiden, IKC, and Infomap.

  • Dynamic Thresholding: Connectivity thresholds can be constants, or functions of the number of nodes in the cluster, or the minimum node degree of the cluster.

  • Multi-processing: For better performance, users can specify a larger number of cores to process clusters concurrently.

Requirements

  • MacOS or Linux operating system
  • python3.9 or higher (UPDATE: We are looking into adding support for python3.12. For the time being, please use 3.9<=python<=3.11)
  • cmake 3.2.0 or higher
  • gcc@10 or higher
  • R with the following packages:
    • data.table
    • feather

Installation and Setup

There are several strategies for installation

Installation via Cloning

  • Clone the cm_pipeline repository
  • Activate the venv which has the necessary packages
  • Run pip install -r requirements.txt && pip install .
  • Make sure everything installed properly by running cd tests && pytest

Installation via pip install

Simply run pip install git+https://github.com/illinois-or-research-analytics/cm_pipeline. This will install CM++, but to use pipeline functionality, please setup via cloning.

Example Commands

CM++

  • python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c leiden -g 0.5 --threshold 1log10 --nprocs 4 --quiet
    • Runs CM++ on a Leiden with resolution 0.5 clustering with connectivity threshold $log_{10}(n)$ (Every cluster with connectivity over the log of the number of nodes is considered "well-connected")
  • python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c ikc -k 10 --threshold 1log10 --nprocs 4 --quiet
    • Similar idea but with IKC having hyperparameter $k=10$.
  • python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c nop --threshold 0.1 --nprocs 4 --quiet
    • Similar idea but with a nop clusterer, meaning it won't recursively cluster, and only ensure that every cluster is connected. This is sometimes called CM-cc or just -cc for short.
  • python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c nop --threshold 1log10 --nprocs 4 --quiet
    • Similar idea but with a nop clusterer, meaning it won't recursively cluster, and only ensure that every cluster is well-connected by performing repeated mincuts. This is sometimes called CM-wcc or just -wcc for short.

CM Pipeline

  • Suppose you have a pipeline like the one here. Call it pipeline.json
  • Then from the root of this repository run:
    • python -m main pipeline.json

For Developers

Loading a Developer Environment

To quickly set up a developer environment for the CM++ Pipeline, simply run the following commands. (NOTE: Make sure you have Conda installed)

bash conda env create -f environment.yml conda activate

Customizing the Pipeline

  • The CM++ Pipeline also allows for users to add their own pipeline stages and clustering methods.
  • Please refer to the customization documentation on how to modify the code to allow for your own pipeline stages and .

Manuscript Data

The data used to generate the speedup curve in the manuscript can be found in the Illinois Databank. In all runs, we followed the command:

python3 -m hm01.cm -i (network file) -e (clustering file) -t 1log10 -g 0.001 -c leiden -q -n (1|2|4|8|16|32)

  • CEN: cen_pipeline.tar.gz
    • Network: cencmquietpipeline/cen-cm-new-pp-output-20230227-22:50:52/S1cen_cleaned.tsv
    • Clustering: cencmquietpipeline/cen-cm-new-pp-output-20230227-22:50:52/res-0.001/S2cen_leiden.0.001.tsv
  • CITPatents: citpatents_networks.tar.gz
    • Network: citpatentsprocessedcm/citpatents_cleaned.tsv
    • Clustering: citpatentsprocessedcm/citpatents_leiden.001.tsv
  • Orkut:

Output Files

  • The commands executed during the workflow are captured in {output_dir}/{run_name}-{timestamp}/commands.sh. This is the shell script generated by the pipeline that is run to generate outputs.
  • The output files generated during the workflow are stored in the folder {output_dir}/{run_name}-{timestamp}/
  • The descriptive analysis files can be found in the folder {output_dir}/{run_name}-{timestamp}/analysis with the *.csv file for each of the resolution values.

Archive

Citations

If you use CM++ in your research, please use the following citation:

bibtex @article{Ramavarapu2024, doi = {10.21105/joss.06073}, url = {https://doi.org/10.21105/joss.06073}, year = {2024}, publisher = {The Open Journal}, volume = {9}, number = {93}, pages = {6073}, author = {Vikram Ramavarapu and Fábio Jose Ayres and Minhyuk Park and Vidya Kamath Pailodi and João Alfredo Cardoso Lamy and Tandy Warnow and George Chacko}, title = {CM++ - A Meta-method for Well-Connected Community Detection}, journal = {Journal of Open Source Software} }

Other Citations

```bibtex @article{Park2024, title = {Well-connectedness and community detection}, volume = {1}, ISSN = {2837-8830}, url = {http://dx.doi.org/10.1371/journal.pcsy.0000009}, DOI = {10.1371/journal.pcsy.0000009}, number = {3}, journal = {PLOS Complex Systems}, publisher = {Public Library of Science (PLoS)}, author = {Park, Minhyuk and Tabatabaee, Yasamin and Ramavarapu, Vikram and Liu, Baqiao and Pailodi, Vidya Kamath and Ramachandran, Rajiv and Korobskiy, Dmitriy and Ayres, Fabio and Chacko, George and Warnow, Tandy}, editor = {Albert, Réka}, year = {2024}, month = nov, pages = {e0000009} }

@software{vikramramavarapu2024_10501118, author={Vikram Ramavarapu and Fabio Jose Ayres and Minhyuk Park and Vidya Kamath P and João Alfredo Cardoso Lamy and Tandy Warnow and George Chacko}, title={{CM++ - A Meta-method for Well-Connected Community Detection}}, month=jan, year=2024, publisher={Zenodo}, version={v4.0.1}, doi={10.5281/zenodo.10501118}, url={https://doi.org/10.5281/zenodo.10501118} }

@misc{park2023wellconnected, title={Well-Connected Communities in Real-World and Synthetic Networks}, author={Minhyuk Park and Yasamin Tabatabaee and Vikram Ramavarapu and Baqiao Liu and Vidya Kamath Pailodi and Rajiv Ramachandran and Dmitriy Korobskiy and Fabio Ayres and George Chacko and Tandy Warnow}, year={2023}, eprint={2303.02813}, archivePrefix={arXiv}, primaryClass={cs.SI} } ```

Owner

  • Name: OR_Research_Analytics
  • Login: illinois-or-research-analytics
  • Kind: organization
  • Location: United States of America

GitHub Events

Total
  • Watch event: 1
  • Push event: 18
  • Pull request event: 4
  • Create event: 1
Last Year
  • Watch event: 1
  • Push event: 18
  • Pull request event: 4
  • Create event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 466
  • Total Committers: 11
  • Avg Commits per committer: 42.364
  • Development Distribution Score (DDS): 0.382
Past Year
  • Commits: 12
  • Committers: 2
  • Avg Commits per committer: 6.0
  • Development Distribution Score (DDS): 0.167
Top Committers
Name Email Commits
Vikram Ramavarapu v****2@i****u 288
Vidya Kamath p****h@g****m 63
vidyak2uiuc 1****c 35
Minhyuk Park m****5@g****m 32
Fabio Ayres f****a@i****r 15
George Chacko c****e 13
Vidya Kamath Pailodi v****2@c****u 10
Vikram Ramavarapu v****2@c****u 6
George Chacko c****e@i****u 2
alessitomas t****i@g****m 1
Joao Alfredo Cardoso Lamy t****y@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 29
  • Total pull requests: 34
  • Average time to close issues: 30 days
  • Average time to close pull requests: 8 days
  • Total issue authors: 8
  • Total pull request authors: 6
  • Average comments per issue: 1.38
  • Average comments per pull request: 0.06
  • Merged pull requests: 32
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 2
  • Average time to close issues: about 1 hour
  • Average time to close pull requests: 1 minute
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.33
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • chackoge (11)
  • MinhyukPark (6)
  • vidyak2uiuc (3)
  • alfredjynx (2)
  • LuisScoccola (1)
  • alessitomas (1)
  • RuneBlaze (1)
Pull Request Authors
  • vikramr2 (19)
  • MinhyukPark (9)
  • vidyak2uiuc (5)
  • alessitomas (2)
  • alfredjynx (2)
  • IanChenUIUC (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • HeapDict ==1.0.1
  • attrs ==22.2.0
  • click ==8.1.3
  • colorama ==0.4.6
  • coloredlogs ==15.0.1
  • connectivity-modifier ==0.1.0b13
  • exceptiongroup ==1.1.1
  • graphviz ==0.20.1
  • humanfriendly ==10.0
  • igraph ==0.10.4
  • iniconfig ==2.0.0
  • jsonpickle ==2.2.0
  • leidenalg ==0.9.1
  • networkit ==10.1
  • networkx ==3.1
  • numpy ==1.24.2
  • packaging ==23.0
  • pandas ==1.5.3
  • pip ==20.2.4
  • pluggy ==1.0.0
  • psutil ==5.9.5
  • pytest ==7.2.2
  • python-dateutil ==2.8.2
  • pytz ==2023.2
  • scipy ==1.10.1
  • setuptools ==50.3.2
  • six ==1.16.0
  • structlog ==22.3.0
  • texttable ==1.6.7
  • tomli ==2.0.1
  • treeswift ==1.1.33
  • typer ==0.6.1
  • typing-extensions ==4.5.0
.github/workflows/documentation.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • peaceiris/actions-gh-pages v3 composite
.github/workflows/draft-pdf.yml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v1 composite
  • openjournals/openjournals-draft-action master composite
setup.py pypi
environment.yml pypi
  • HeapDict ==1.0.1
  • attrs ==22.2.0
  • click ==8.1.3
  • colorama ==0.4.6
  • coloredlogs ==15.0.1
  • connectivity-modifier ==0.1.0b13
  • exceptiongroup ==1.1.1
  • graphviz ==0.20.1
  • humanfriendly ==10.0
  • igraph ==0.10.4
  • infomap ==2.7
  • iniconfig ==2.0.0
  • jsonpickle ==2.2.0
  • leidenalg ==0.9.1
  • networkit ==10.1
  • networkx ==3.1
  • numpy ==1.24.2
  • packaging ==23.0
  • pandas ==1.5.3
  • pip ==20.2.4
  • pluggy ==1.0.0
  • psutil ==5.9.5
  • pytest ==7.2.2
  • python-dateutil ==2.8.2
  • pytz ==2023.2
  • scipy ==1.10.1
  • setuptools ==50.3.2
  • six ==1.16.0
  • structlog ==22.3.0
  • texttable ==1.6.7
  • tomli ==2.0.1
  • treeswift ==1.1.33
  • typer ==0.6.1
  • typing-extensions ==4.5.0