seriation

Infrastructure for Ordering using Seriation - R Package

https://github.com/mhahsler/seriation

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 15 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
1 of 7 committers (14.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary

Keywords

combinatorial-optimization cran ordination r seriation

Last synced: 6 months ago · JSON representation

Repository

Infrastructure for Ordering using Seriation - R Package

Basic Info

Host: GitHub
Owner: mhahsler
License: gpl-3.0
Language: R
Default Branch: master
Size: 28.2 MB

Statistics

Stars: 82
Watchers: 7
Forks: 17
Open Issues: 2
Releases: 18

Topics

combinatorial-optimization cran ordination r seriation

Created over 10 years ago · Last pushed 7 months ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---

```{r echo=FALSE, results = 'asis'}
pkg <- 'seriation'

source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg, anaconda = "r-seriation", stackoverflow = "seriation+r")
```

## Introduction

Seriation arranges a set of objects into a linear order given available data with the goal of revealing structural information. This package provides the infrastructure for ordering objects 
with an implementation of many
[seriation](https://en.wikipedia.org/wiki/Seriation_(archaeology))/[ordination](https://en.wikipedia.org/wiki/Ordination_(statistics)) techniques to reorder data matrices, dissimilarity
matrices, correlation matrices, and dendrograms (see below for a complete list). 
The package provides several visualizations (grid and ggplot2) to reveal structural information,
including permuted image plots, reordered heatmaps, Bertin plots, clustering visualizations like dissimilarity plots, and
visual assessment of cluster tendency plots (VAT and iVAT).

Here are some quick guides on applications of seriation:

* [Introduction the R package seriation](https://cran.r-project.org/package=seriation/vignettes/seriation.pdf)
* [How to reorder heatmaps](https://mhahsler.github.io/seriation/heatmaps.html)
* [How to reorder correlation matrices](https://mhahsler.github.io/seriation/correlation_matrix.html)
* [How to evaluate clusters using dissimilarity plots](https://mhahsler.github.io/seriation/clustering.html)

Implemented seriation methods and criteria:

* [Documentation of the implemented seriation methods](https://mhahsler.github.io/seriation/seriation_methods.html)
* [Documentation of the implemented seriation criteria](https://mhahsler.github.io/seriation/seriation_criteria.html)
* [A visual comparison between seriation methods](https://mhahsler.github.io/seriation/comparison.html)

```{r echo=FALSE, results = 'asis'}
pkg_usage(pkg)
pkg_citation(pkg, 2L)
```

## Available seriation methods to reorder dissimilarity data

Seriation methods for dissimilarity data reorder the set of objects in the data. 
The methods fall into several groups based on the criteria they 
try to optimize, constraints (like dendrograms), and the algorithmic approach.

### Dendrogram leaf order

These methods create a dendrogram using hierarchical clustering and then derive 
the seriation order from the leaf order in the dendrogram. 
Leaf reordering may be applied.
 
 *  **DendSer** - Dendrogram seriation heuristic to optimize various criteria
 *  **GW** - Hierarchical clustering reordered by the Gruvaeus and Wainer heuristic 
 *  **HC** - Hierarchical clustering (single link, avg. link, complete link) 
 *  **OLO** - Hierarchical clustering with optimal leaf ordering 
 
### Dimensionality reduction  

Find a seriation order by reducing the dimensionality to 1 dimension. This is typically done
by minimizing a stress measure or the reconstruction error.

 *  **MDS** - classical metric multidimensional scaling
 *  **MDS_angle** - order by the angular order in the 2D MDS projection space split by the larges gap
 *  **isoMDS** -  1D Krusakl's non-metric multidimensional scaling
 *  **isomap** -  1D isometric feature mapping ordination
 *  **monoMDS** - order along 1D global and local non-metric multidimensional scaling using monotone regression (NMDS)
 *  **metaMDS** - 1D non-metric multidimensional scaling (NMDS) with stable solution from random starts
 *  **Sammon** - Order along the 1D Sammon's non-linear mapping
 *  **smacof** - 1D MDS using majorization (ratio MDS, interval MDS, ordinal MDS)
 *  **TSNE** - Order along the 1D t-distributed stochastic neighbor embedding (t-SNE)
 *  **UMAP** - Order along the 1D embedding produced by uniform manifold approximation and projection

### Optimization

 These methods try to optimize a seriation criterion directly, typically using a heuristic approach.
 
 *  **ARSA** - optimize the linear seriation criterion using simulated annealing   
 *  **Branch-and-bound** to minimize the unweighted/weighted column gradient 
 *  **GA** - Genetic algorithm with warm start to optimize any seriation criteria
 *  **GSA** - General simulated annealing to optimize any seriation criteria
 *  **SGD** - stochastic gradient descent to find a local optimum given an initial order and 
          a seriation criterion.
 *  **QAP** - Quadratic assignment problem heuristic (optimizes 2-SUM, linear seriation, 
          inertia, banded anti-Robinson form)
 *  **Spectral** seriation to optimize the 2-SUM criterion (unnormalized, normalized) 
 *  **TSP** - Traveling salesperson solver to minimize the Hamiltonian path length 
 
### Other Methods

 *  **Identity** permutation 
 *  **OPTICS** - Order of ordering points to identify the clustering structure
 *  **R2E** - Rank-two ellipse seriation 
 *  **Random** permutation
 *  **Reverse** order
 *  **SPIN** - Sorting points into neighborhoods (neighborhood algorithm, side-to-site algorithm)
 *  **VAT** - Order of the visual assessment of clustering tendency
  
A detailed comparison of the most popular methods is available in the paper 
[An experimental comparison of seriation methods for one-mode two-way data.](http://dx.doi.org/10.1016/j.ejor.2016.08.066) (read the [preprint](https://michael.hahsler.net/research/paper/EJOR_seriation_2016.pdf)).
  

## Available seriation methods to reorder data matrices, count tables, and data.frames  

For matrices, rows and columns are reordered. 

### Seriating rows and columns simultaneously

Row and column order influence each other.

 *  **BEA** - Bond Energy Algorithm to maximize the measure of effectiveness (ME) 
 *  **BEA_TSP** - TSP to optimize the measure of effectiveness
 *  **BK_unconstrained** - Algorithm by Brower and Kyle (1988) to arrange binary matrices. 
 *  **CA** - calculates a correspondence analysis of a matrix of frequencies (count table)
      and reorders according to the scores on a correspondence analysis dimension

### Seriating rows and columns separately using dissimilarities

 *  **Heatmap** - reorders rows and columns independently by calculating row/column distances 
      and then applying a seriation method for dissimilarities (see above)

### Seriate rows in a data matrix

These methods need access to the data matrix instead of dissimilarities
to reorder objects (rows). The same approach can be applied to columns.

 *  **PCA_angle** - order by the angular order in the 2D PCA projection space split by the larges gap
 *  **LLE** reorder along a 1D locally linear embedding
 *  **Means** - reorders using row means
 *  **PCA** - orders along the first principal component
 *  **TSNE** - Order along the 1D t-distributed stochastic neighbor embedding (t-SNE)
 *  **UMAP** - Order along the 1D embedding produced by uniform manifold approximation and projection

### Other methods
 
 *  **AOE** - order by the angular order of the first two eigenvectors for correlation matrices.
 *  **Identity** permutation 
 *  **Random** permutation 
 *  **Reverse** order


```{r echo=FALSE, results = 'asis'}
pkg_install(pkg)
```

## Usage

The used example dataset contains the joint probability of disagreement 
between Supreme Court Judges from 1995 to 2002. The goal
is to reveal structural information in this data.
We load the library, read the data, convert the data to a distance matrix, 
and then use the default seriation method to reorder the objects. 
```{r}
library(seriation)
data("SupremeCourt")

d <- as.dist(SupremeCourt)
d

order <- seriate(d)
order
```

Here is the resulting permutation vector.
```{r}
get_order(order)
```

Next, we visualize the original and permuted distance matrix.

```{r seriation, fig.show="hold", out.width="50%"}
pimage(d, main = "Judges (original alphabetical order)")
pimage(d, order, main = "Judges (reordered by seriation)")
```

Darker squares around the main diagonal indicate groups of similar objects.
After seriation, two groups are visible.

We can compare the available seriation criteria. Seriation improves all measures. 
Note that some measures are merit measures while others represent cost. 
See the manual page for details.
```{r}
rbind(
 alphabetical = criterion(d),
 seriated = criterion(d, order)
)
```

Some seriation methods also return a linear configuration where more similar objects are
located closer to each other. 
```{r configuration, fig.align="center", fig.height = 3}
get_config(order)

plot_config(order)
```

We can see a clear divide between the two groups in the configuration. 

## References

* Michael Hahsler, Kurt Hornik and Christian Buchta, [Getting Things in Order: An Introduction to the R Package seriation,](http://dx.doi.org/10.18637/jss.v025.i03) _Journal of Statistical Software,_ 25(3), 2008.
DOI: 10.18637/jss.v025.i03
* Michael Hahsler. [An experimental comparison of seriation methods for one-mode two-way data.](http://dx.doi.org/10.1016/j.ejor.2016.08.066) _European Journal of Operational Research,_ 257:133-143, 2017. 
DOI: 10.1016/j.ejor.2016.08.066
(read the [preprint](https://michael.hahsler.net/research/paper/EJOR_seriation_2016.pdf))
* Hahsler, M. and Hornik, K. (2011): [Dissimilarity plots: A visual exploration tool for partitional clustering.](http://dx.doi.org/10.1198/jcgs.2010.09139) _Journal of Computational and Graphical Statistics,_ **10**(2):335–354. DOI: 10.1198/jcgs.2010.09139 (read the
[preprint](https://michael.hahsler.net/research/paper/dissplot_JCGS2011_preprint.pdf); [code examples](https://mhahsler.github.io/seriation/seriation_cluster_evaluation.html))
* [Reference manual for package seriation.](https://mhahsler.r-universe.dev/seriation/doc/manual.html#seriation-package)

Owner

Name: Michael Hahsler
Login: mhahsler
Kind: user
Location: Dallas, TX
Company: SMU

Website: http://michael.hahsler.net
Repositories: 32
Profile: https://github.com/mhahsler

I develop packages for AI, ML, and Data Science.

GitHub Events

Total

Create event: 4
Release event: 1
Issues event: 4
Watch event: 5
Delete event: 2
Issue comment event: 6
Push event: 4
Pull request event: 3
Fork event: 1

Last Year

Create event: 4
Release event: 1
Issues event: 4
Watch event: 5
Delete event: 2
Issue comment event: 6
Push event: 4
Pull request event: 3
Fork event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 325
Total Committers: 7
Avg Commits per committer: 46.429
Development Distribution Score (DDS): 0.025

Past Year

Commits: 17
Committers: 3
Avg Commits per committer: 5.667
Development Distribution Score (DDS): 0.176

Top Committers

Name	Email	Commits
Michael Hahsler	m**l@h**t	317
kbvernon	k**n@g**m	2
David Barnett	d**t@m**l	2
Mia Dai	5****i	1
Matthew DeMaere	c****s	1
Dirk Seidensticker	d**r@g**m	1
Ben Marwick	b**k@h**m	1

Committer Domains (Top 20 + Academic)

maastrichtuniversity.nl: 1 hahsler.net: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 18
Total pull requests: 8
Average time to close issues: 14 days
Average time to close pull requests: 1 day
Total issue authors: 17
Total pull request authors: 7
Average comments per issue: 2.83
Average comments per pull request: 1.5
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 3
Average time to close issues: 18 days
Average time to close pull requests: 1 day
Issue authors: 2
Pull request authors: 3
Average comments per issue: 2.5
Average comments per pull request: 0.67
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Derek-Jones (2)
NicolasHipp (1)
muniheart (1)
thomasp85 (1)
cysouw (1)
friendly (1)
ggrothendieck (1)
RichardKav (1)
crazyhottommy (1)
benmarwick (1)
slowkow (1)
psyguy (1)
heikostark (1)
bryanhanson (1)
Tomer-Tsaban (1)

Pull Request Authors

benmarwick (2)
david-barnett (2)
kbvernon (2)
mhahsler (2)
cerebis (1)
dirkseidensticker (1)
Xinming-Dai (1)

Top Labels

Issue Labels

question (5) enhancement (4) bug (2) help wanted (2)

Pull Request Labels

bug (1)

Packages

Total packages: 2
Total downloads:
- cran 23,354 last-month
Total docker downloads: 150,563

Total dependent packages: 32
(may contain duplicates)
Total dependent repositories: 78
(may contain duplicates)
Total versions: 64
Total maintainers: 1

cran.r-project.org: seriation

Infrastructure for Ordering Objects Using Seriation

Homepage: https://github.com/mhahsler/seriation
Documentation: http://cran.r-project.org/web/packages/seriation/seriation.pdf
License: GPL-3
Latest release: 1.5.8
published 6 months ago

Versions: 52
Dependent Packages: 29
Dependent Repositories: 78
Downloads: 23,354 Last month
Docker Downloads: 150,563

Rankings

Downloads: 2.0%

Dependent packages count: 2.6%

Dependent repos count: 2.6%

Forks count: 4.6%

Stargazers count: 5.0%

Average: 6.3%

Docker downloads count: 21.1%

Maintainers (1)

mhahsler@lyle.smu.edu

Last synced: 6 months ago

conda-forge.org: r-seriation

Homepage: https://github.com/mhahsler/seriation
License: GPL-3.0-only
Latest release: 1.4.0
published over 3 years ago

Versions: 12
Dependent Packages: 3
Dependent Repositories: 0

Rankings

Dependent packages count: 15.6%

Average: 30.3%

Dependent repos count: 34.0%

Stargazers count: 34.3%

Forks count: 37.3%

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 2.14.0 depends
MASS * imports
TSP * imports
cluster * imports
colorspace * imports
gclus * imports
grDevices * imports
grid * imports
qap * imports
registry * imports
stats * imports
DendSer * suggests
GA * suggests
Rtsne * suggests
dbscan * suggests
dendextend * suggests
ggplot2 * suggests
scales * suggests
testthat * suggests
umap * suggests