CM++ - A Meta-method for Well-Connected Community Detection
CM++ - A Meta-method for Well-Connected Community Detection - Published in JOSS (2024)
https://github.com/illinois-or-research-analytics/cm_pipeline
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 18 DOI reference(s) in README -
✓Academic publication links
Links to: joss.theoj.org, zenodo.org -
✓Committers with academic emails
5 of 11 committers (45.5%) from academic institutions -
✓Institutional organization owner
Organization illinois-or-research-analytics has institutional domain (grainger.illinois.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Scientific Fields
Repository
Pipeline that uses an improved version of CM for generating well-connected graph clusters
Basic Info
Statistics
- Stars: 5
- Watchers: 3
- Forks: 5
- Open Issues: 10
- Releases: 16
Metadata Files
README.md
CM++ Pipeline
Customizable modular pipeline for testing an improved version of CM for generating well-connected clusters. Image below from arXiv preprint: Park et. al. (2023). https://github.com/illinois-or-research-analytics/cm_pipeline/tree/main with a GUI now available!
- CM++ Pipeline
Figure: CM Pipeline Overview The Connectivity Modifier pipeline starts with an input network and a clustering algorithm. In the first step, an initial clustering on the entire network is obtained. Afterwards, clusters below size $B$ (default: $B=11$ ) and tree clusters are removed from this initial clustering. On this filtered clustering, each cluster is processed as follows. If a cluster has an edge cut below the threshold (default: $\log_{10}{n}$ where $n$ is the number of nodes in the cluster) the edge cut is removed and the two pieces are re-clustered. This process repeats until all clusters are well-connected. The final processing removes small clusters. Note the user can change the value for $B$ and the threshold for connectivity as these are user-defined. The user has the option to not apply the recursive clustering. See information on CC and WCC.
Documentation
For the full documentation see here
Overview
Main Features
By default, CM modifies an input clustering to ensure that each cluster is well-connected. CM does this by doing rounds of mincut and clustering. It is also possible to run CM in a way that does not recursively cluster (CC and WCC).
Default CM:
CM under default settings of $B=11$ and threshold= $\log_{10}{n}$ where $n$ is the number of nodes in the cluster, meaning remove tree clusters and clusters below size $B$ and also ensure that each cluster has a minimum edge cut size greater than the threshold.
Click to expand example command
- command:
python -m main pipeline.json - pipeline.json:
{
"title": <custom name for this run>,
"name": <custom name of your network>,
"input_file": <path to your network edgelist>,
"output_dir": <output directory>,
"algorithm": <clustering algorithm e.g., ikc, leiden, leiden_mod>,
"params": [
{
<parameter name e.g., res, i>: <parameter value>
}
],
"stages": [
{
"name": "cleanup"
},
{
"name": "clustering",
"parallel_limit": 2
},
{
"name": "stats",
"parallel_limit": 2
},
{
"name": "filtering",
"scripts": [
"./scripts/subset_graph_nonetworkit_treestar.R",
"./scripts/make_cm_ready.R"
]
},
{
"name": "connectivity_modifier",
"memprof": <boolean for whether to profile memory e.g., true or false>,
"threshold": <well-connectedness threshold e.g., 1log10>,
"nprocs": <number of processors for parallelism>,
"quiet": <boolean whether to print outputs to console e.g., true or false>
},
{
"name": "filtering",
"scripts": [
"./scripts/post_cm_filter.R"
]
},
{
"name": "stats",
"parallel_limit": 2
}
]
}
CM without removing small clusters or tree clusters
Click to expand example command
- command:
python -m main pipeline.json pipeline.json:
{ "title": <custom name for this run>, "name": <custom name of your network>, "input_file": <path to your network edgelist>, "output_dir": <output directory>, "algorithm": <clustering algorithm e.g., ikc, leiden, leiden_mod>, "params": [ { <parameter name e.g., res, i>: <parameter value> } ], "stages": [ { "name": "cleanup" }, { "name": "clustering", "parallel_limit": 2 }, { "name": "stats", "parallel_limit": 2 }, { "name": "connectivity_modifier", "memprof": <boolean for whether to profile memory e.g., true or false>, "threshold": <well-connectedness threshold e.g., 1log10>, "nprocs": <number of processors for parallelism>, "quiet": <boolean whether to print outputs to console e.g., true or false> }, { "name": "stats", "parallel_limit": 2 } ] }
WCC (Well Connected Components)
This is equivalent to running CM pipeline with the pre and post filtering steps turned off and not applying clustering recursively. Equivalently, each cluster is repeatedly cut until it is well-connected. The user has the option of adding in the filtering steps if desired.
Click to expand example command
- command:
python3 -m hm01.cm -i <input network edgelist path> -e <input existing clustering path> -o <output filepath> -c nop --threshold <threshold e.g., 1log10> --nprocs <number of processors>
CC (Connected Components)
Some clustering methods produce disconnected clusters. CC will take such a clustering and returns the connected components of these clusters. The user has the option of adding in the filtering steps if desired.
Click to expand example command
- command:
python3 -m hm01.cm -i <input network edgelist path> -e <input existing clustering path> -o <output filepath> -c nop --threshold 0.1 --nprocs <number of processors>
Clustering methods that can be used with CM
CM currently enables the use of Leiden optimizing the constant Potts Model, Leiden optimizing under modularity, Iterative K-Core, Markov CLustering, Infomap, and Stochastic Block Models. Example for Stochastic Block Model shown below.
Click to expand example command
- command:
python3 -m hm01.cm -i <input network edgelist path> -e <input existing clustering path> -o <output filepath> -c external -cargs cargs.json -cfile <clusterer file path e.g., path to hm01/clusterers/external_clusterers/sbm_wrapper.py> --threshold <threhsold e.g., 1log10> --nprocs <number of processors> - cargs.json:
{ <param key e.g., "block_state">: <param value e.g., "non_nested_sbm", "planted_partition_model"> <param key 2 e.g., "degree_corrected">: <param value e.g. true, false> }
CM Pipeline Overview
The CM Pipeline is a modular pipeline for community detection that contains the following modules:
Graph Cleaning: Removal of parallel and duplicate edges as well as self loops
Community Detection: Clusters an input network with one of Leiden, IKC, and InfoMap.
Cluster Filtration: A pre-processing stage that allows users to filter out clusters that are trees or have size below a given threshold.
Community Statistics Reporting: Generates node and edge count, modularity score, Constant Potts Model score, conductance, and edge-connectivity at multiple stages.
Extensibility: Developers can design new stages and wire in new algorithms. Please see the following document for instructions on how to expand the list of supported clustering algorithms as a developer.
CM++
CM++ Overview
CM++ is a module within the CM Pipeline, having the following features:
Function: CM++ refines your existing graph clustering by carving them into well-connected clusters with high minimum cut values.
Flexibility: Users can accompany their definition of a good community with well-connectedness. CM++ works with any clustering algorithm and presently provides build in support for Leiden, IKC, and Infomap.
Dynamic Thresholding: Connectivity thresholds can be constants, or functions of the number of nodes in the cluster, or the minimum node degree of the cluster.
Multi-processing: For better performance, users can specify a larger number of cores to process clusters concurrently.
Requirements
- MacOS or Linux operating system
python3.9or higher (UPDATE: We are looking into adding support for python3.12. For the time being, please use 3.9<=python<=3.11)cmake 3.2.0or highergcc@10or higherRwith the following packages:-
data.table -
feather
-
Installation and Setup
There are several strategies for installation
Installation via Cloning
- Clone the cm_pipeline repository
- Activate the venv which has the necessary packages
- Run
pip install -r requirements.txt && pip install . - Make sure everything installed properly by running
cd tests && pytest
Installation via pip install
Simply run pip install git+https://github.com/illinois-or-research-analytics/cm_pipeline. This will install CM++, but to use pipeline functionality, please setup via cloning.
Example Commands
CM++
python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c leiden -g 0.5 --threshold 1log10 --nprocs 4 --quiet- Runs CM++ on a Leiden with resolution 0.5 clustering with connectivity threshold $log_{10}(n)$ (Every cluster with connectivity over the log of the number of nodes is considered "well-connected")
python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c ikc -k 10 --threshold 1log10 --nprocs 4 --quiet- Similar idea but with IKC having hyperparameter $k=10$.
python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c nop --threshold 0.1 --nprocs 4 --quiet- Similar idea but with a nop clusterer, meaning it won't recursively cluster, and only ensure that every cluster is connected. This is sometimes called CM-cc or just -cc for short.
python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c nop --threshold 1log10 --nprocs 4 --quiet- Similar idea but with a nop clusterer, meaning it won't recursively cluster, and only ensure that every cluster is well-connected by performing repeated mincuts. This is sometimes called CM-wcc or just -wcc for short.
CM Pipeline
- Suppose you have a pipeline like the one here. Call it
pipeline.json - Then from the root of this repository run:
python -m main pipeline.json
For Developers
Loading a Developer Environment
To quickly set up a developer environment for the CM++ Pipeline, simply run the following commands. (NOTE: Make sure you have Conda installed)
bash
conda env create -f environment.yml
conda activate
Customizing the Pipeline
- The CM++ Pipeline also allows for users to add their own pipeline stages and clustering methods.
- Please refer to the customization documentation on how to modify the code to allow for your own pipeline stages and .
Manuscript Data
The data used to generate the speedup curve in the manuscript can be found in the Illinois Databank. In all runs, we followed the command:
python3 -m hm01.cm -i (network file) -e (clustering file) -t 1log10 -g 0.001 -c leiden -q -n (1|2|4|8|16|32)
- CEN: cen_pipeline.tar.gz
- Network: cencmquietpipeline/cen-cm-new-pp-output-20230227-22:50:52/S1cen_cleaned.tsv
- Clustering: cencmquietpipeline/cen-cm-new-pp-output-20230227-22:50:52/res-0.001/S2cen_leiden.0.001.tsv
- CITPatents: citpatents_networks.tar.gz
- Network: citpatentsprocessedcm/citpatents_cleaned.tsv
- Clustering: citpatentsprocessedcm/citpatents_leiden.001.tsv
- Orkut:
Output Files
- The commands executed during the workflow are captured in
{output_dir}/{run_name}-{timestamp}/commands.sh. This is the shell script generated by the pipeline that is run to generate outputs. - The output files generated during the workflow are stored in the folder
{output_dir}/{run_name}-{timestamp}/ - The descriptive analysis files can be found in the folder
{output_dir}/{run_name}-{timestamp}/analysiswith the*.csvfile for each of the resolution values.
Archive
Citations
If you use CM++ in your research, please use the following citation:
bibtex
@article{Ramavarapu2024,
doi = {10.21105/joss.06073},
url = {https://doi.org/10.21105/joss.06073},
year = {2024},
publisher = {The Open Journal},
volume = {9},
number = {93},
pages = {6073},
author = {Vikram Ramavarapu and Fábio Jose Ayres and Minhyuk Park and Vidya Kamath Pailodi and João Alfredo Cardoso Lamy and Tandy Warnow and George Chacko},
title = {CM++ - A Meta-method for Well-Connected Community Detection},
journal = {Journal of Open Source Software}
}
Other Citations
```bibtex @article{Park2024, title = {Well-connectedness and community detection}, volume = {1}, ISSN = {2837-8830}, url = {http://dx.doi.org/10.1371/journal.pcsy.0000009}, DOI = {10.1371/journal.pcsy.0000009}, number = {3}, journal = {PLOS Complex Systems}, publisher = {Public Library of Science (PLoS)}, author = {Park, Minhyuk and Tabatabaee, Yasamin and Ramavarapu, Vikram and Liu, Baqiao and Pailodi, Vidya Kamath and Ramachandran, Rajiv and Korobskiy, Dmitriy and Ayres, Fabio and Chacko, George and Warnow, Tandy}, editor = {Albert, Réka}, year = {2024}, month = nov, pages = {e0000009} }
@software{vikramramavarapu2024_10501118, author={Vikram Ramavarapu and Fabio Jose Ayres and Minhyuk Park and Vidya Kamath P and João Alfredo Cardoso Lamy and Tandy Warnow and George Chacko}, title={{CM++ - A Meta-method for Well-Connected Community Detection}}, month=jan, year=2024, publisher={Zenodo}, version={v4.0.1}, doi={10.5281/zenodo.10501118}, url={https://doi.org/10.5281/zenodo.10501118} }
@misc{park2023wellconnected, title={Well-Connected Communities in Real-World and Synthetic Networks}, author={Minhyuk Park and Yasamin Tabatabaee and Vikram Ramavarapu and Baqiao Liu and Vidya Kamath Pailodi and Rajiv Ramachandran and Dmitriy Korobskiy and Fabio Ayres and George Chacko and Tandy Warnow}, year={2023}, eprint={2303.02813}, archivePrefix={arXiv}, primaryClass={cs.SI} } ```
Owner
- Name: OR_Research_Analytics
- Login: illinois-or-research-analytics
- Kind: organization
- Location: United States of America
- Website: https://grainger.illinois.edu/research
- Repositories: 4
- Profile: https://github.com/illinois-or-research-analytics
GitHub Events
Total
- Watch event: 1
- Push event: 18
- Pull request event: 4
- Create event: 1
Last Year
- Watch event: 1
- Push event: 18
- Pull request event: 4
- Create event: 1
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Vikram Ramavarapu | v****2@i****u | 288 |
| Vidya Kamath | p****h@g****m | 63 |
| vidyak2uiuc | 1****c | 35 |
| Minhyuk Park | m****5@g****m | 32 |
| Fabio Ayres | f****a@i****r | 15 |
| George Chacko | c****e | 13 |
| Vidya Kamath Pailodi | v****2@c****u | 10 |
| Vikram Ramavarapu | v****2@c****u | 6 |
| George Chacko | c****e@i****u | 2 |
| alessitomas | t****i@g****m | 1 |
| Joao Alfredo Cardoso Lamy | t****y@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 29
- Total pull requests: 34
- Average time to close issues: 30 days
- Average time to close pull requests: 8 days
- Total issue authors: 8
- Total pull request authors: 6
- Average comments per issue: 1.38
- Average comments per pull request: 0.06
- Merged pull requests: 32
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 2
- Average time to close issues: about 1 hour
- Average time to close pull requests: 1 minute
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.33
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- chackoge (11)
- MinhyukPark (6)
- vidyak2uiuc (3)
- alfredjynx (2)
- LuisScoccola (1)
- alessitomas (1)
- RuneBlaze (1)
Pull Request Authors
- vikramr2 (19)
- MinhyukPark (9)
- vidyak2uiuc (5)
- alessitomas (2)
- alfredjynx (2)
- IanChenUIUC (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- HeapDict ==1.0.1
- attrs ==22.2.0
- click ==8.1.3
- colorama ==0.4.6
- coloredlogs ==15.0.1
- connectivity-modifier ==0.1.0b13
- exceptiongroup ==1.1.1
- graphviz ==0.20.1
- humanfriendly ==10.0
- igraph ==0.10.4
- iniconfig ==2.0.0
- jsonpickle ==2.2.0
- leidenalg ==0.9.1
- networkit ==10.1
- networkx ==3.1
- numpy ==1.24.2
- packaging ==23.0
- pandas ==1.5.3
- pip ==20.2.4
- pluggy ==1.0.0
- psutil ==5.9.5
- pytest ==7.2.2
- python-dateutil ==2.8.2
- pytz ==2023.2
- scipy ==1.10.1
- setuptools ==50.3.2
- six ==1.16.0
- structlog ==22.3.0
- texttable ==1.6.7
- tomli ==2.0.1
- treeswift ==1.1.33
- typer ==0.6.1
- typing-extensions ==4.5.0
- actions/checkout v2 composite
- actions/setup-python v2 composite
- peaceiris/actions-gh-pages v3 composite
- actions/checkout v3 composite
- actions/upload-artifact v1 composite
- openjournals/openjournals-draft-action master composite
- HeapDict ==1.0.1
- attrs ==22.2.0
- click ==8.1.3
- colorama ==0.4.6
- coloredlogs ==15.0.1
- connectivity-modifier ==0.1.0b13
- exceptiongroup ==1.1.1
- graphviz ==0.20.1
- humanfriendly ==10.0
- igraph ==0.10.4
- infomap ==2.7
- iniconfig ==2.0.0
- jsonpickle ==2.2.0
- leidenalg ==0.9.1
- networkit ==10.1
- networkx ==3.1
- numpy ==1.24.2
- packaging ==23.0
- pandas ==1.5.3
- pip ==20.2.4
- pluggy ==1.0.0
- psutil ==5.9.5
- pytest ==7.2.2
- python-dateutil ==2.8.2
- pytz ==2023.2
- scipy ==1.10.1
- setuptools ==50.3.2
- six ==1.16.0
- structlog ==22.3.0
- texttable ==1.6.7
- tomli ==2.0.1
- treeswift ==1.1.33
- typer ==0.6.1
- typing-extensions ==4.5.0