singlelinkagepartitions.jl

Minimalist algorithm for generating nested single-linkage partitions.

https://github.com/rwalgorithms/singlelinkagepartitions.jl

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Minimalist algorithm for generating nested single-linkage partitions.

Basic Info

Host: GitHub
Owner: RWAlgorithms
License: mpl-2.0
Language: Julia
Default Branch: main
Homepage:
Size: 586 KB

Statistics

Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 1

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

SingleLinkagePartitions

Minimalist single-linkage clustering.

Documentation

The documentation for an overview and examples.

Quick-start

```julia import Random Random.seed!(25)

import SingleLinkagePartitions as SL using LinearAlgebra

T = Float64

singlet-linkage clustering.

generate data randomly:

D = 3

N = 10

X = collect( randn(D) for _ = 1:N )

or, use this default set of points.

X = [ [-0.40243248293794137, 0.8540414903329187, -0.6651248667822778], [-0.9754736537655263, -0.341379056315598, -1.0410555755312705], [-1.0496381529964869, -1.0289732912432703, -0.4305269991795779], [0.7262044632757468, 0.8138894370909177, 0.6104189261116074], [2.0501294946950264, 0.18095967976913974, 0.9406747232875855], [1.0407043988018494, -0.14776493165237461, -0.8737149501414327], [1.0484097740458416, 0.7379871044247477, -0.02494318852134621], [-0.32639477363891256, -1.45405586112584, 0.5104603413606108], [-0.6283556049853254, 0.35921840490046464, -1.1166717373759707], [1.2421363428579315, 0.47437350434528236, -0.5869506255304089], ] N = length(X)

metric = SL.geteuclideanmetric() pt = SL.computesl(metric, X)

distancesset = SL.getdistances(pt) partitionset = SL.generateallpartitions(pt) ```

The output: ``` julia> distance_set = SL.getdistances(pt) 9-element Vector{Float64}: 0.6502879922496316 0.7069552456099949 0.7140482053065044 0.7164233686190572 0.7855224671411318 0.9225136028232729 1.2606479889562476 1.4987127483818177 1.5900454258666015

julia> partition_set = SL.generateallpartitions(pt) 10-element Vector{Vector{Vector{Int64}}}: [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]] [[1], [2], [3], [4], [5], [6], [7, 10], [8], [9]] [[1, 9], [2], [3], [4], [5], [6], [7, 10], [8]] [[1, 9], [2], [3], [4], [5], [6, 7, 10], [8]] [[1, 9], [2], [3], [4, 6, 7, 10], [5], [8]] [[1, 2, 9], [3], [4, 6, 7, 10], [5], [8]] [[1, 2, 3, 9], [4, 6, 7, 10], [5], [8]] [[1, 2, 3, 8, 9], [4, 6, 7, 10], [5]] [[1, 2, 3, 8, 9], [4, 5, 6, 7, 10]] [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]] ```

Reduce points

We can use single-linkage partition tree to combine a set of points X in a part (e.g., cluster) of a partition. To select the partition given some suggested (this won't always be respected by the output) tolerance, atol, and zero tolerance for distance between duplicate points, zero_tol.

level_trait determines how we pick a level in the partition tree, which corresponds to picking a partition.

Each part in the selected partition, partition_r, generates a point in the output reduced point set Xc, and estimated variance vs. center_trait determines how we create a representative point given the points in a part, and we do this for each part in the partition to get the reduced point set.

``julia atol = convert(T, 1.0) level_trait = SL.UseSLDistance(atol) #atolis compared aaginstdistance_set to pick a level.

other trais.

leveltrait = SL.UseCumulativeSLDistance() # atol is compared against cumsum(distanceset) to pick a level.

For each part, pick the point that has the lowest corresponding score among the points in the part as the representative for that part.

score = randn(T, N) center_trait = SL.UseScore(SL.UseMinimum(), score) # use SL.UseMaximum() to select based on the highest score.

For each part, pick the point that has the closest Euclidean distance to the mean among the points in the part as the representative for that part. The mean is taken over all the points of the part.

center_trait = SL.UseProximitytoMean() # when Xc must be a subset of X

For each part, assign the mean over all the points in the part as the representative. This implies we are not using a point in the original point set `X` as the representative points, so `Xc` is not a subset of `X`.

center_trait = SL.UseMean() # when Xc can be any pt from which X is constructed

Xc, vs, partitionr = SL.reducepts( leveltrait, center_trait, metric, X, ) ```

We could also pass in an accompanying scalar value for each point in X. These scalar values make up y, and can be a complex-valued or real-valued floating-point number that isn't of the same data type as the points in X. Whatever scheme we used to pick a representative for each part for Xc, we do for yc. Similarly, we provide a variance vs_y for yc, and a set of variances vs_X for Xc. ```julia

y = randn(Complex{Float32}, N) Xc, vsX, yc, vsy, partitionr = SL.reducepts( leveltrait, center_trait, metric, X, y, )

```

For each part in a given parition, we can also get the maximum magnitude deviation from the mean for each dimension. julia ds_X = SL.computedeviationdims(X, partition_r)

Merge duplicates

Suppose we have a dataset with duplicate locations, but different values. This might occur with historic weather station data from different sources, for example. To merge the entries with the same locations (X) by taking the average of the station measurements (y) we can use avgduplicates.

For example, let's generate station positions X0: ```julia import SingleLinkagePartitions as SL import Random Random.seed!(25)

N_stations = 5 T = Float64

generate station positions.

X0 = collect( randn(T, 2) for _ = 1:N_stations )

generate station measurements.

y0 = randn(T, N_stations)

create a duplicate station location but different measurement.

push!(X0, X0[begin]) push!(y0, randn(T))

average.

X, y = SL.avgduplicates(X0, y0, eps(T)*10) ```

Let's see the input and result locations:

display:

julia display(X) # input locations display(X0) # result (merged) locations

``` julia> display(X) 5-element Vector{Vector{Float64}}: [-0.40243248293794137, 0.8540414903329187] [-0.6651248667822778, -0.9754736537655263] [-0.341379056315598, -1.0410555755312705] [-1.0496381529964869, -1.0289732912432703] [-0.4305269991795779, 0.7262044632757468]

julia> display(X0) 6-element Vector{Vector{Float64}}: [-0.40243248293794137, 0.8540414903329187] [-0.6651248667822778, -0.9754736537655263] [-0.341379056315598, -1.0410555755312705] [-1.0496381529964869, -1.0289732912432703] [-0.4305269991795779, 0.7262044632757468] [-0.40243248293794137, 0.8540414903329187] The first and last entries of `X0` were merged. Let's inspect if the measurements were merged by average:julia display(y0) display(y) y1rec = (y0[begin]+y0[end])/2 display( abs(y1rec - y[begin])) ```

Indeed! ``` julia> display(y0) 6-element Vector{Float64}: 0.8138894370909177 0.6104189261116074 2.0501294946950264 0.18095967976913974 0.9406747232875855 1.0407043988018494

julia> display(y) 5-element Vector{Float64}: 0.9272969179463835 0.6104189261116074 2.0501294946950264 0.18095967976913974 0.9406747232875855

julia> y1rec = (y0[begin]+y0[end])/2 0.9272969179463835

julia> display( abs(y1rec - y[begin])) 0.0 ```

Citation

You can use the Cite This Repository button below the About section on the GitHub webpage of this repository.

Owner

Name: RWAlgorithms
Login: RWAlgorithms
Kind: organization
Location: Canada

Repositories: 1
Profile: https://github.com/RWAlgorithms

Algorithms implemented in Julia that have minimal dependencies

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Wang"
  given-names: "Roy Chih Chung"
  orcid: "https://orcid.org/0000-0002-1391-4536"
title: "RoyCCWang/SingleLinkagePartitions.jl"
version: 2.1.0
date-released: 2024-01-10
url: "https://github.com/RWAlgorithms/SingleLinkagePartitions.jl/"