https://github.com/aggrathon/rslc

Robust Single-Linkage Clustering

https://github.com/aggrathon/rslc

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary

Keywords

clustering data-science hierarchical-clustering machine-learning machine-learning-algorithms python r robust-machine-learning rust single-linkage
Last synced: 5 months ago · JSON representation

Repository

Robust Single-Linkage Clustering

Basic Info
  • Host: GitHub
  • Owner: Aggrathon
  • License: mit
  • Language: Rust
  • Default Branch: master
  • Homepage:
  • Size: 207 KB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Topics
clustering data-science hierarchical-clustering machine-learning machine-learning-algorithms python r robust-machine-learning rust single-linkage
Created over 5 years ago · Last pushed over 5 years ago
Metadata Files
Readme License

README.md

Robust Single-Linkage Clustering

Robust Single-Linkage Clustering is a robust variant of hierarchical clustering with a single-linkage merging function. The single-linkage strategy merges two clusters based on the minimum distance between any item in one cluster to any item in the other cluster. The advantage of single-linkage is that it is able to capture non-spherical clusters. The main downside is that it is really prone to capture outliers, i.e., items (or small groups of items) so far away from all other items that they create their own clusters, which stops single-linkage for finding the desired clusters, unless the number of clusters is quite large. RSLC mitigates this issue with small outlier clusters by restricting the minimum size of the clusters.

To do this the clusters are created "top-down", by splitting clusters, rather than "bottom-up", by merging clusters. To do this we first need to recognise the connection between single-linkage and minimal spanning trees in graphs. If you remove the k-1 largest edges in a minimal spanning tree you get k disconnected subtrees, or clusters. Thus, by turning the distances between the items into a fully connected graph the "Reverse-delete algorithm" for finding a minimal spanning tree can be used for single-linkage clustering, as long as we modify the algorithm to allow disconnected subtrees.

Furthermore, to make the algorithm more robust towards outliers we want to avoid too small clusters. After every edge removal we check if a new cluster is created. If the new cluster is smaller than a threshold then we can mark the items in that cluster as outliers and re-add the edge (just like how reverse-delete usually avoids new clusters).

Pseudocode

{pseudocode} Create an undirected fully connected graph G<N, E>, where the weights of the edges in E are the distances between the items (nodes) in N. Sort E in descending order. For each edge e in E: Remove e from E. If this splits a subgraph into two disconnected subgraphs G1 and G2: If |G1| < min_size: Add e to E. mark every node n in G1 as an outlier. If |G2| < min_size: Add e to E. mark every node n in G2 as an outlier. If the desired number of disconnected subgraphs in G is reached: Return the disconnected subgraphs in G as clusters.

Complexity

There are |E| = n * (n + 1) / 2 edges in the graph, where n is the number of items, so sorting the edges is O(n^2 log(n)). Looping through the edges takes O(n^2) and checking for subgraphs takes at most O(n^2). This suggest a total complexity of O(n^4) which is quite horrible, but in practise it will probably be closer to O(n^3) since the whole graph doesn't have to be evaluated in order to check for disconnected subgraphs.

By running the algorithm on different sized datasets and measuring the time we can fit polynomials to approximate the complexity. Here we can see that the third and fourth degree polynomials are almost identical and almost perfectly match the measured times, which suggests that O(n^3) might indeed be the real complexity (although I currently don't have a proof for that).

Empirical Complexity

This also shows that, on a modern machine, RSLC is suitable for interactive use on datasets up to a size of 5000. Note that the time measured here only considers RSLC, things like calculating the distance matrix takes additional time.

Installation

To use the algorithm in Rust add rslc = { git = "https://github.com/Aggrathon/rslc/" } under [dependencies] in your Cago.toml.

There are also API bindings for Python and R in respective subdirectories.

Owner

  • Name: Anton Björklund
  • Login: Aggrathon
  • Kind: user
  • Company: @edahelsinki

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels