dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package

https://github.com/mhahsler/dbscan

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.5%) to scientific vocabulary

Keywords

clustering cran dbscan density-based-clustering hdbscan lof optics r
Last synced: 6 months ago · JSON representation

Repository

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package

Basic Info
  • Host: GitHub
  • Owner: mhahsler
  • License: gpl-3.0
  • Language: C++
  • Default Branch: master
  • Size: 9.4 MB
Statistics
  • Stars: 335
  • Watchers: 13
  • Forks: 63
  • Open Issues: 4
  • Releases: 17
Topics
clustering cran dbscan density-based-clustering hdbscan lof optics r
Created over 10 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License

README.Rmd

---
output: github_document
bibliography: vignettes/dbscan.bib
link-citations: yes
---

```{r echo=FALSE, results = 'asis'}
pkg <- 'dbscan'

source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg, anaconda = "r-dbscan", stackoverflow = "dbscan%2br")
```

## Introduction

This R package [@hahsler2019dbscan] provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data.
The package includes: 
 
__Clustering__

- __DBSCAN:__ Density-based spatial clustering of applications with noise [@ester1996density].
- __Jarvis-Patrick Clustering__: Clustering using a similarity measure based
on shared near neighbors [@jarvis1973].
- __SNN Clustering__: Shared nearest neighbor clustering [@erdoz2003].
- __HDBSCAN:__  Hierarchical DBSCAN with simplified hierarchy extraction [@campello2015hierarchical].
- __FOSC:__ Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree [@campello2013density].
- __OPTICS/OPTICSXi:__ Ordering points to identify the clustering structure and cluster extraction methods
  [@ankerst1999optics].

__Outlier Detection__

- __LOF:__ Local outlier factor algorithm [@breunig2000lof]. 
- __GLOSH:__ Global-Local Outlier Score from Hierarchies algorithm [@campello2015hierarchical]. 

__Cluster Evaluation__

- __DBCV:__ Density-based clustering validation [@moulavi2014].

__Fast Nearest-Neighbor Search (using kd-trees)__

- __kNN search__
- __Fixed-radius NN search__


The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are
for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package `fpc`), or the 
implementations in [WEKA](https://ml.cms.waikato.ac.nz/weka/), [ELKI](https://elki-project.github.io/) and [Python's scikit-learn](https://scikit-learn.org/).

```{r echo=FALSE, results = 'asis'}
pkg_usage(pkg)
pkg_citation(pkg, 2)
pkg_install(pkg)
```

## Usage

Load the package and use the numeric variables in the iris dataset
```{r}
library("dbscan")

data("iris")
x <- as.matrix(iris[, 1:4])
```

DBSCAN
```{r}
db <- dbscan(x, eps = .42, minPts = 5)
db
```

Visualize the resulting clustering (noise points are shown in black).
```{r dbscan}
pairs(x, col = db$cluster + 1L)
```


OPTICS
```{r}
opt <- optics(x, eps = 1, minPts = 4)
opt
```

Extract DBSCAN-like clustering from OPTICS 
and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)
```{r OPTICS_extractDBSCAN, fig.height=3}
opt <- extractDBSCAN(opt, eps_cl = .4)
plot(opt)
```

HDBSCAN

```{r}
hdb <- hdbscan(x, minPts = 4)
hdb
```

Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.

```{r hdbscan, fig.height=4}
plot(hdb, show_flat = TRUE)
```

## Using dbscan with tidyverse

`dbscan` provides for all clustering algorithms `tidy()`, `augment()`, and `glance()` so they can
be easily used with tidyverse, ggplot2 and [tidymodels](https://www.tidymodels.org/learn/statistics/k-means/).

```{r tidyverse, message=FALSE, warning=FALSE}
library(tidyverse)
db <- x %>% dbscan(eps = .42, minPts = 5)
```

Get cluster statistics as a tibble

```{r tidyverse2}
tidy(db)
```

Visualize the clustering with ggplot2 (use an x for noise points)
```{r tidyverse3}
augment(db, x) %>% 
  ggplot(aes(x = Petal.Length, y = Petal.Width)) +
    geom_point(aes(color = .cluster, shape = noise)) +
    scale_shape_manual(values=c(19, 4))

```




## Using dbscan from Python
R, the R package `dbscan`, and the Python package `rpy2` need to be installed.

```{python, eval = FALSE, python.reticulate = FALSE}
import pandas as pd
import numpy as np

### prepare data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                   header = None, 
                   names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'])
iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]

# get R dbscan package
from rpy2.robjects import packages
dbscan = packages.importr('dbscan')

# enable automatic conversion of pandas dataframes to R dataframes
from rpy2.robjects import pandas2ri
pandas2ri.activate()

db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5)
print(db)
```

```
## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.5, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 2 cluster(s) and 17 noise points.
## 
##  0  1  2 
## 17 49 84 
## 
## Available fields: cluster, eps, minPts, dist, borderPoints
```

```{python, eval = FALSE, python.reticulate = FALSE}
# get the cluster assignment vector
labels = np.array(db.rx('cluster'))
labels
```

```
## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
##         1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
##         2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
##         2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
##       dtype=int32)
```

## License 
The dbscan package is licensed under the [GNU General Public License (GPL) Version 3](https://www.gnu.org/licenses/gpl-3.0.en.html). The __OPTICSXi__ R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.  

## Changes
* List of changes from [NEWS.md](https://github.com/mhahsler/dbscan/blob/master/NEWS.md)

## References

Owner

  • Name: Michael Hahsler
  • Login: mhahsler
  • Kind: user
  • Location: Dallas, TX
  • Company: SMU

I develop packages for AI, ML, and Data Science.

GitHub Events

Total
  • Create event: 1
  • Release event: 1
  • Issues event: 6
  • Watch event: 28
  • Issue comment event: 3
  • Push event: 10
  • Fork event: 2
Last Year
  • Create event: 1
  • Release event: 1
  • Issues event: 6
  • Watch event: 28
  • Issue comment event: 3
  • Push event: 10
  • Fork event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 311
  • Total Committers: 9
  • Avg Commits per committer: 34.556
  • Development Distribution Score (DDS): 0.174
Past Year
  • Commits: 49
  • Committers: 3
  • Avg Commits per committer: 16.333
  • Development Distribution Score (DDS): 0.388
Top Committers
Name Email Commits
mhahsler m****l@h****t 257
Matt Piekenbrock m****k@g****m 26
Maximilian Muecke m****n@g****m 18
Matt Piekenbrock m****m 5
cmalzer 1****r 1
Zach Schuster z****r@g****m 1
Taekyun Kim t****m@h****m 1
James Lamb j****0@g****m 1
Erich Schubert k****0 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 51
  • Total pull requests: 28
  • Average time to close issues: about 1 year
  • Average time to close pull requests: 3 months
  • Total issue authors: 47
  • Total pull request authors: 10
  • Average comments per issue: 3.65
  • Average comments per pull request: 1.5
  • Merged pull requests: 24
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 0
  • Average time to close issues: 2 days
  • Average time to close pull requests: N/A
  • Issue authors: 3
  • Pull request authors: 0
  • Average comments per issue: 0.67
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • m-muecke (3)
  • mhahsler (2)
  • moredatapls (2)
  • joeroe (2)
  • kmzapp (1)
  • zachmayer (1)
  • jwijffels (1)
  • kageazusa (1)
  • ZhenyiWangTHU (1)
  • akarlinsky (1)
  • Sandy4321 (1)
  • sverchkov (1)
  • ankhnesmerira (1)
  • GearFear (1)
  • elbamos (1)
Pull Request Authors
  • m-muecke (35)
  • moredatapls (2)
  • cmalzer (2)
  • peekxc (2)
  • mhahsler (2)
  • kno10 (2)
  • eduardokapp (1)
  • zschuster (1)
  • taekyunk (1)
  • jameslamb (1)
Top Labels
Issue Labels
enhancement (12) bug (10) question (7) invalid (2) help wanted (2)
Pull Request Labels
bug (1)

Packages

  • Total packages: 2
  • Total downloads:
    • cran 29,201 last-month
  • Total docker downloads: 47,892
  • Total dependent packages: 60
    (may contain duplicates)
  • Total dependent repositories: 126
    (may contain duplicates)
  • Total versions: 36
  • Total maintainers: 1
cran.r-project.org: dbscan

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

  • Versions: 27
  • Dependent Packages: 58
  • Dependent Repositories: 123
  • Downloads: 29,201 Last month
  • Docker Downloads: 47,892
Rankings
Forks count: 1.3%
Stargazers count: 1.7%
Dependent packages count: 1.7%
Dependent repos count: 1.9%
Downloads: 2.9%
Average: 3.1%
Docker downloads count: 9.2%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-dbscan
  • Versions: 9
  • Dependent Packages: 2
  • Dependent Repositories: 3
Rankings
Dependent repos count: 17.9%
Dependent packages count: 19.6%
Average: 21.5%
Forks count: 23.8%
Stargazers count: 24.5%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • Rcpp >= 1.0.0 imports
  • graphics * imports
  • stats * imports
  • dendextend * suggests
  • fpc * suggests
  • igraph * suggests
  • knitr * suggests
  • microbenchmark * suggests
  • rmarkdown * suggests
  • testthat * suggests