https://github.com/aggrathon/rslc

Robust Single-Linkage Clustering

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary

Keywords

clustering data-science hierarchical-clustering machine-learning machine-learning-algorithms python r robust-machine-learning rust single-linkage

Last synced: 5 months ago · JSON representation

Repository

Robust Single-Linkage Clustering

Basic Info

Host: GitHub
Owner: Aggrathon
License: mit
Language: Rust
Default Branch: master
Homepage:
Size: 207 KB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 2

Topics

clustering data-science hierarchical-clustering machine-learning machine-learning-algorithms python r robust-machine-learning rust single-linkage

Created over 5 years ago · Last pushed over 5 years ago

Metadata Files

Readme License

Robust Single-Linkage Clustering

Robust Single-Linkage Clustering is a robust variant of hierarchical clustering with a single-linkage merging function. The single-linkage strategy merges two clusters based on the minimum distance between any item in one cluster to any item in the other cluster. The advantage of single-linkage is that it is able to capture non-spherical clusters. The main downside is that it is really prone to capture outliers, i.e., items (or small groups of items) so far away from all other items that they create their own clusters, which stops single-linkage for finding the desired clusters, unless the number of clusters is quite large. RSLC mitigates this issue with small outlier clusters by restricting the minimum size of the clusters.

To do this the clusters are created "top-down", by splitting clusters, rather than "bottom-up", by merging clusters. To do this we first need to recognise the connection between single-linkage and minimal spanning trees in graphs. If you remove the k-1 largest edges in a minimal spanning tree you get k disconnected subtrees, or clusters. Thus, by turning the distances between the items into a fully connected graph the "Reverse-delete algorithm" for finding a minimal spanning tree can be used for single-linkage clustering, as long as we modify the algorithm to allow disconnected subtrees.

Furthermore, to make the algorithm more robust towards outliers we want to avoid too small clusters. After every edge removal we check if a new cluster is created. If the new cluster is smaller than a threshold then we can mark the items in that cluster as outliers and re-add the edge (just like how reverse-delete usually avoids new clusters).

Pseudocode

{pseudocode} Create an undirected fully connected graph G<N, E>, where the weights of the edges in E are the distances between the items (nodes) in N. Sort E in descending order. For each edge e in E: Remove e from E. If this splits a subgraph into two disconnected subgraphs G1 and G2: If |G1| < min_size: Add e to E. mark every node n in G1 as an outlier. If |G2| < min_size: Add e to E. mark every node n in G2 as an outlier. If the desired number of disconnected subgraphs in G is reached: Return the disconnected subgraphs in G as clusters.

Complexity

There are |E| = n * (n + 1) / 2 edges in the graph, where n is the number of items, so sorting the edges is O(n^2 log(n)). Looping through the edges takes O(n^2) and checking for subgraphs takes at most O(n^2). This suggest a total complexity of O(n^4) which is quite horrible, but in practise it will probably be closer to O(n^3) since the whole graph doesn't have to be evaluated in order to check for disconnected subgraphs.

By running the algorithm on different sized datasets and measuring the time we can fit polynomials to approximate the complexity. Here we can see that the third and fourth degree polynomials are almost identical and almost perfectly match the measured times, which suggests that O(n^3) might indeed be the real complexity (although I currently don't have a proof for that).

Empirical Complexity

This also shows that, on a modern machine, RSLC is suitable for interactive use on datasets up to a size of 5000. Note that the time measured here only considers RSLC, things like calculating the distance matrix takes additional time.

Installation

To use the algorithm in Rust add rslc = { git = "https://github.com/Aggrathon/rslc/" } under [dependencies] in your Cago.toml.

There are also API bindings for Python and R in respective subdirectories.

Owner

Name: Anton Björklund
Login: Aggrathon
Kind: user
Company: @edahelsinki

Repositories: 3
Profile: https://github.com/Aggrathon

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/aggrathon/rslc

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Robust Single-Linkage Clustering

Pseudocode

Complexity

Installation

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels