anticlust

Subset partitioning via anticlustering

https://github.com/m-py/anticlust

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 36 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Subset partitioning via anticlustering

Basic Info

Host: GitHub
Owner: m-Py
License: other
Language: R
Default Branch: main
Homepage:
Size: 4.87 MB

Statistics

Stars: 36
Watchers: 5
Forks: 8
Open Issues: 9
Releases: 35

Created over 7 years ago · Last pushed 12 months ago

Metadata Files

Readme Changelog License

anticlust

Anticlustering partitions a pool of elements into clusters (or anticlusters) with the goal of achieving high between-cluster similarity and high within-cluster heterogeneity. This is accomplished by maximizing instead of minimizing a clustering objective function, such as the intra-cluster variance (used in k-means clustering) or the sum of pairwise distances within clusters. The package anticlust implements anticlustering methods as described in Papenberg and Klau (2021; https://doi.org/10.1037/met0000301), Brusco et al. (2020; https://doi.org/10.1111/bmsp.12186), Papenberg (2024; https://doi.org/10.1111/bmsp.12315), and Papenberg et al. (2025; https://doi.org/10.1101/2025.03.03.641320).

Installation

The stable release of anticlust is available from CRAN and can be installed via:

install.packages("anticlust")

A (potentially more recent) version of anticlust can also be installed via R Universe:

install.packages('anticlust', repos = c('https://m-py.r-universe.dev', 'https://cloud.r-project.org'))

or directly via Github:

library("remotes") # if not available: install.packages("remotes")
install_github("m-Py/anticlust")

Citation

If you use anticlust in your research, it would be courteous if you cite the following reference:

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301

Depending on which anticlust functions you are using, including other references may also be fair. Here you can find out in detail how to cite anticlust.

Another great way of showing your appreciation of anticlust is to leave a star on this Github repository.

How do I learn about `anticlust`

This README contains some basic information on the R package anticlust. More information is available via the following sources:

Up until now, we published 3 papers describing the theoretical background of anticlust.
- The initial presentation of the anticlust package is given in Papenberg and Klau (2021) (https://doi.org/10.1111/bmsp.12315; Preprint).
- The k-plus anticlustering method is described in Papenberg (2024) (https://doi.org/10.1037/met0000527; Preprint).
- A new paper describes the must-link feature and provides additional comparisons to alternative methods, focusing on categorical variables (Papenberg et al., 2025; https://doi.org/10.1101/2025.03.03.641320).
- The R documentation of the main functions is actually quite rich and up to date, so you should definitely check that out when using the anticlust package. The most important background is provided in ?anticlustering.
A video is available in German language where I illustrate the main functionalities of the anticlustering() function. My plan is to make a similar video in English in the future.
The package website contains all documentation as a convenient website. At the current time, the website also has four package vignettes, while additional vignettes are planned.

A quick start

In this initial example, I use the main function anticlustering() to create five similar sets of plants using the classical iris data set:

First, load the package via

library("anticlust")

Call the anticlustering() method:

anticlusters <- anticlustering(
  iris[, -5],
  K = 5,
  objective = "kplus",
  method = "local-maximum",
  repetitions = 10
)

The output is a vector that assigns a group (i.e, a number between 1 and K) to each input element:

anticlusters
#>   [1] 1 2 4 5 3 4 2 3 2 2 1 5 1 2 4 1 2 3 2 5 1 5 4 5 1 1 3 4 5 5 5 4 5 2 1 1 3
#>  [38] 4 3 3 4 2 3 5 2 5 3 4 3 1 2 2 5 1 2 3 3 4 4 1 5 1 2 3 3 1 2 4 4 4 4 1 3 4
#>  [75] 2 4 5 2 5 2 3 3 1 5 4 1 5 3 2 1 2 5 3 4 1 4 1 2 4 5 2 2 3 1 4 1 3 4 4 5 3
#> [112] 2 3 1 5 2 5 3 1 5 4 1 2 5 1 2 3 1 3 3 5 1 2 5 5 4 3 5 4 3 5 5 1 4 4 1 3 4
#> [149] 2 2

By default, each group has the same number of elements (but the argument K can be adjusted to request different group sizes):

table(anticlusters)
#> anticlusters
#>  1  2  3  4  5 
#> 30 30 30 30 30

Last, let’s compare the features’ means and standard deviations across groups to find out if the five groups are similar to each other:

knitr::kable(mean_sd_tab(iris[, -5], anticlusters), row.names = TRUE)

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
1	5.84 (0.84)	3.06 (0.44)	3.76 (1.79)	1.20 (0.77)
2	5.84 (0.84)	3.06 (0.45)	3.76 (1.79)	1.20 (0.77)
3	5.84 (0.84)	3.06 (0.44)	3.75 (1.79)	1.20 (0.77)
4	5.85 (0.84)	3.05 (0.45)	3.76 (1.79)	1.21 (0.77)
5	5.84 (0.84)	3.06 (0.44)	3.76 (1.79)	1.19 (0.78)

As illustrated in the example, we can use the function anticlustering() to create similar groups of plants. In this case “similar” primarily means that the means and standard deviations (in parentheses) of the variables are pretty much the same across the five groups. The function anticlustering() takes as input a data table describing the elements that should be assigned to sets. In the data table, each row represents an element (here a plant, but it can be anything; for example a person, word, or a photo). Each column is a numeric variable describing one of the elements’ features. The number of groups is specified through the argument K. The argument objective specifies how between-group similarity is quantified; the argument method specifies the algorithm by which this measure is optimized. See the documentation ?anticlustering for more details.

Five anticlustering objectives are natively supported in anticlustering():

the “diversity” objective, setting objective = "diversity" (default)
the “average-diversity”, setting objective = "average-diversity", which normalizes the diversity by cluster size
the k-means objective (i.e., the “variance”) setting objective = "variance"
the “k-plus” objective, an extension of the k-means variance criterion, setting objective = "kplus"
the “dispersion” objective (the minimum distance between any two elements within the same cluster), setting objective = "dispersion"

The anticlustering objectives are described in detail in the documentation (?anticlustering, ?diversity_objective, ?variance_objective, ?kplus_anticlustering, ?dispersion_objective) and the references therein. It is also possible to optimize user-defined objectives, which is also described in the documentation (?anticlustering).

Categorical variables

Sometimes, it is required that sets are not only similar with regard to some numeric variables, but we also want to ensure that each set contains an equal number of elements of a certain category. Coming back to the initial iris data set, we may want to require that each set has a balanced number of plants of the three iris species. To this end, we can use the argument categories as follows:

anticlusters <- anticlustering(
  iris[, -5],
  K = 3,
  categories = iris$Species
)

## The species are as balanced as possible across anticlusters:
table(anticlusters, iris$Species)
#>             
#> anticlusters setosa versicolor virginica
#>            1     17         17        16
#>            2     17         16        17
#>            3     16         17        17

Matching and clustering

Anticlustering creates sets of dissimilar elements; the heterogenity within anticlusters is maximized. This is the opposite of clustering problems that strive for high within-cluster similarity and good separation between clusters. The anticlust package also provides functions for “classical” clustering applications: balanced_clustering() creates sets of elements that are similar while ensuring that clusters are of equal size. This is an example:

# Generate random data, cluster the data set and visualize results
N <- 1400
lds <- data.frame(var1 = rnorm(N), var2 = rnorm(N))
cl <- balanced_clustering(lds, K = 7)
plot_clusters(lds, clusters = cl, show_axes = TRUE)

The function matching() is very similar, but is usually used to find small groups of similar elements, e.g., triplets as in this example:

# Generate random data and find triplets of similar elements:
N <- 120
lds <- data.frame(var1 = rnorm(N), var2 = rnorm(N))
triplets <- matching(lds, p = 3)
plot_clusters(
  lds,
  clusters = triplets,
  within_connection = TRUE,
  show_axes = TRUE
)

Questions and suggestions

If you have any question on the anticlust package or find some bugs, I encourage you to open an issue on the Github repository.

Owner

Name: Martin Papenberg
Login: m-Py
Kind: user
Location: Düsseldorf, Germany
Company: Department of Experimental Psychology, University of Düsseldorf

Website: https://m-py.github.io/about
Repositories: 17
Profile: https://github.com/m-Py

Post-doctoral researcher at the University of Duesseldorf.

GitHub Events

Total

Create event: 10
Release event: 5
Issues event: 11
Watch event: 7
Delete event: 9
Issue comment event: 17
Push event: 38
Pull request event: 2
Fork event: 3

Last Year

Create event: 10
Release event: 5
Issues event: 11
Watch event: 7
Delete event: 9
Issue comment event: 17
Push event: 38
Pull request event: 2
Fork event: 3

Committers

Last synced: over 2 years ago

All Time

Total Commits: 1,298
Total Committers: 4
Avg Commits per committer: 324.5
Development Distribution Score (DDS): 0.005

Past Year

Commits: 135
Committers: 1
Avg Commits per committer: 135.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Martin Papenberg	m**g@h**e	1,291
einGlasRotwein	j**z@g**m	4
manalama	m**r@u**e	2
unDocUMeantIt	1****t	1

Committer Domains (Top 20 + Academic)

uni-duesseldorf.de: 1 hhu.de: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 30
Total pull requests: 40
Average time to close issues: 2 months
Average time to close pull requests: 23 days
Total issue authors: 5
Total pull request authors: 8
Average comments per issue: 2.17
Average comments per pull request: 0.43
Merged pull requests: 34
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 8
Pull requests: 4
Average time to close issues: 30 days
Average time to close pull requests: about 1 month
Issue authors: 2
Pull request authors: 3
Average comments per issue: 1.38
Average comments per pull request: 2.25
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

m-Py (23)
A-Pai (4)
rbcavanaugh (1)
uhkeller (1)
viv-analytics (1)

Pull Request Authors

m-Py (31)
HanneyAI (2)
unDocUMeantIt (2)
ManaLama (2)
Hanney100 (2)
olivroy (2)
einGlasRotwein (1)
Dimitry-Wintermantel (1)

Top Labels

Issue Labels

bug (7) cleanup (4) Enhancement (4) maybe (3) wontfix (2) invalid (1) help wanted (1) good first issue (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 1,059 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 18
Total maintainers: 1

cran.r-project.org: anticlust

Subset Partitioning via Anticlustering

Homepage: https://github.com/m-Py/anticlust
Documentation: http://cran.r-project.org/web/packages/anticlust/anticlust.pdf
License: MIT + file LICENSE
Latest release: 0.8.10
published over 1 year ago

Versions: 18
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 1,059 Last month

Rankings

Stargazers count: 10.2%

Forks count: 12.2%

Downloads: 15.7%

Average: 18.1%

Dependent repos count: 23.8%

Dependent packages count: 28.6%

Maintainers (1)

martin.papenberg@hhu.de

Last synced: 10 months ago

Dependencies

DESCRIPTION cran

R >= 3.6.0 depends
Matrix * imports
RANN >= 2.6.0 imports
Rglpk * suggests
knitr * suggests
rmarkdown * suggests
testthat * suggests

anticlust

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

anticlust

Installation

Citation

How do I learn about anticlust

A quick start

Categorical variables

Matching and clustering

Questions and suggestions

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: anticlust

Rankings

Maintainers (1)

Dependencies

How do I learn about `anticlust`