fastcluster.jl

Julia wrapper for the fastcluster library for hierarchical clustering

https://github.com/jmboehm/fastcluster.jl

Science Score: 18.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Julia wrapper for the fastcluster library for hierarchical clustering

Basic Info
  • Host: GitHub
  • Owner: jmboehm
  • License: mit
  • Language: Julia
  • Default Branch: master
  • Homepage:
  • Size: 146 KB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created about 8 years ago · Last pushed over 5 years ago
Metadata Files
Readme License Citation

README.md

Fastcluster.jl

Build Status Coverage Status codecov.io

Julia wrapper to Daniel Muellner's fastcluster library for hierarchical clustering.

Installation

julia Pkg.clone("http://github.com/jmboehm/Fastcluster.jl.git")

Usage

The main function is

julia linkage(d::Array{T,2}, method::Symbol) where {T<:Real} which returns a tuple m, h that contains the dendrogram information. The input arguments are: - d::Array{Float64,2} is the dissimilarity matrix between the points to cluster. You can use the Distances.jl package to generate the dissimilarity matrix (see example below). - method::Symbol is one of the following: :single, :complete, :average, :weighted, :ward, :centroid, :median. These clustering methods are described in the documentation of fastcluster. Note that the behavior of :ward is different to those in the R and Python interfaces (see below).

The function julia linkage!(d::Array{T,2}, method::Symbol) where {T<:Real} is a memory-saving alternative that allows fastcluster to overwrite some content in d, instead of allocating more memory for the computations.

Finally, you cut the dendrogram at a particular height to get a specified number of clusters k with the function

julia function cutree(m::Vector{Int32}, nobs::Int64, k::Int64) where - m::Vector{Int32} is the m component of the dendrogram returned by linkage(). - nobs::Int64 is the number of original observations. By default, that is (length(m)>>1)+1 - k::Int64 is the desired number of clusters. The behavior of this function is very similar their counterparts in R and python.

Example

```julia

using RDatasets, Fastcluster

df = dataset("datasets", "iris")

points = convert(Array{Float64,2},df[:,[:SepalWidth, :SepalLength]]) d = pairwise(Euclidean(), points, dims=1) m,h = linkage(d, :single) cut = cutree(m,(length(m)>>1)+1,3) ```

Important Caveat for Ward Linkage

NOTE: The methods :ward, :centroid, and :median the function assumes that the distance metric used is the squared Euclidean distance (e.g. SqEuclidean() in Distances.jl). This is different to the R interface of fastcluster, which, for the Ward.D2 method, operates on the squares of the distances that are passed to the hclust function. (The Python interface operates on the squares of the distances passed to the linkage function for all three methods, :ward, :centroid, and :median.) We choose this way in order to save on memory.

Hence, the following two snippets produce the same output: julia using RDatasets, Fastcluster df = dataset("datasets", "iris") points = convert(Array{Float64,2},df[:,[:SepalWidth, :SepalLength]]) d = pairwise(SqEuclidean(), points, dims=1) m,h = linkage(d, meth) cut = Fastcluster.cutree(m,(length(m)>>1)+1,3)

```julia using RDatasets, Fastcluster using RCall

df = dataset("datasets", "iris") points = convert(Array{Float64,2},df[:,[:SepalWidth, :SepalLength]]) d2 = pairwise(Euclidean(), points, dims=1) @rput d R"library('fastcluster')" R"clusters <- hclust(as.dist(d), \"ward.D2\")" R"clusterCut <- cutree(clusters, k = 3)" @rget clusterCut ```

Owner

  • Name: Johannes Boehm
  • Login: jmboehm
  • Kind: user
  • Location: Paris, France

Citation (CITATION.txt)

To cite fastcluster in publications, please use:

Daniel Müllner, fastcluster: Fast Hierarchical, Agglomerative Clustering
Routines for R and Python, Journal of Statistical Software, 53 (2013), no. 9,
1–18, http://www.jstatsoft.org/v53/i09/.

GitHub Events

Total
Last Year