dbscan
Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary
Keywords
clustering
cran
dbscan
density-based-clustering
hdbscan
lof
optics
r
Last synced: 6 months ago
·
JSON representation
Repository
Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
Basic Info
- Host: GitHub
- Owner: mhahsler
- License: gpl-3.0
- Language: C++
- Default Branch: master
- Size: 9.4 MB
Statistics
- Stars: 335
- Watchers: 13
- Forks: 63
- Open Issues: 4
- Releases: 17
Topics
clustering
cran
dbscan
density-based-clustering
hdbscan
lof
optics
r
Created over 10 years ago
· Last pushed 7 months ago
Metadata Files
Readme
Changelog
License
README.Rmd
---
output: github_document
bibliography: vignettes/dbscan.bib
link-citations: yes
---
```{r echo=FALSE, results = 'asis'}
pkg <- 'dbscan'
source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg, anaconda = "r-dbscan", stackoverflow = "dbscan%2br")
```
## Introduction
This R package [@hahsler2019dbscan] provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data.
The package includes:
__Clustering__
- __DBSCAN:__ Density-based spatial clustering of applications with noise [@ester1996density].
- __Jarvis-Patrick Clustering__: Clustering using a similarity measure based
on shared near neighbors [@jarvis1973].
- __SNN Clustering__: Shared nearest neighbor clustering [@erdoz2003].
- __HDBSCAN:__ Hierarchical DBSCAN with simplified hierarchy extraction [@campello2015hierarchical].
- __FOSC:__ Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree [@campello2013density].
- __OPTICS/OPTICSXi:__ Ordering points to identify the clustering structure and cluster extraction methods
[@ankerst1999optics].
__Outlier Detection__
- __LOF:__ Local outlier factor algorithm [@breunig2000lof].
- __GLOSH:__ Global-Local Outlier Score from Hierarchies algorithm [@campello2015hierarchical].
__Cluster Evaluation__
- __DBCV:__ Density-based clustering validation [@moulavi2014].
__Fast Nearest-Neighbor Search (using kd-trees)__
- __kNN search__
- __Fixed-radius NN search__
The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are
for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package `fpc`), or the
implementations in [WEKA](https://ml.cms.waikato.ac.nz/weka/), [ELKI](https://elki-project.github.io/) and [Python's scikit-learn](https://scikit-learn.org/).
```{r echo=FALSE, results = 'asis'}
pkg_usage(pkg)
pkg_citation(pkg, 2)
pkg_install(pkg)
```
## Usage
Load the package and use the numeric variables in the iris dataset
```{r}
library("dbscan")
data("iris")
x <- as.matrix(iris[, 1:4])
```
DBSCAN
```{r}
db <- dbscan(x, eps = .42, minPts = 5)
db
```
Visualize the resulting clustering (noise points are shown in black).
```{r dbscan}
pairs(x, col = db$cluster + 1L)
```
OPTICS
```{r}
opt <- optics(x, eps = 1, minPts = 4)
opt
```
Extract DBSCAN-like clustering from OPTICS
and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)
```{r OPTICS_extractDBSCAN, fig.height=3}
opt <- extractDBSCAN(opt, eps_cl = .4)
plot(opt)
```
HDBSCAN
```{r}
hdb <- hdbscan(x, minPts = 4)
hdb
```
Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.
```{r hdbscan, fig.height=4}
plot(hdb, show_flat = TRUE)
```
## Using dbscan with tidyverse
`dbscan` provides for all clustering algorithms `tidy()`, `augment()`, and `glance()` so they can
be easily used with tidyverse, ggplot2 and [tidymodels](https://www.tidymodels.org/learn/statistics/k-means/).
```{r tidyverse, message=FALSE, warning=FALSE}
library(tidyverse)
db <- x %>% dbscan(eps = .42, minPts = 5)
```
Get cluster statistics as a tibble
```{r tidyverse2}
tidy(db)
```
Visualize the clustering with ggplot2 (use an x for noise points)
```{r tidyverse3}
augment(db, x) %>%
ggplot(aes(x = Petal.Length, y = Petal.Width)) +
geom_point(aes(color = .cluster, shape = noise)) +
scale_shape_manual(values=c(19, 4))
```
## Using dbscan from Python
R, the R package `dbscan`, and the Python package `rpy2` need to be installed.
```{python, eval = FALSE, python.reticulate = FALSE}
import pandas as pd
import numpy as np
### prepare data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header = None,
names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'])
iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]
# get R dbscan package
from rpy2.robjects import packages
dbscan = packages.importr('dbscan')
# enable automatic conversion of pandas dataframes to R dataframes
from rpy2.robjects import pandas2ri
pandas2ri.activate()
db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5)
print(db)
```
```
## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.5, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 2 cluster(s) and 17 noise points.
##
## 0 1 2
## 17 49 84
##
## Available fields: cluster, eps, minPts, dist, borderPoints
```
```{python, eval = FALSE, python.reticulate = FALSE}
# get the cluster assignment vector
labels = np.array(db.rx('cluster'))
labels
```
```
## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
## 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
## 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
## 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
## 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
## 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
## 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
## dtype=int32)
```
## License
The dbscan package is licensed under the [GNU General Public License (GPL) Version 3](https://www.gnu.org/licenses/gpl-3.0.en.html). The __OPTICSXi__ R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.
## Changes
* List of changes from [NEWS.md](https://github.com/mhahsler/dbscan/blob/master/NEWS.md)
## References
Owner
- Name: Michael Hahsler
- Login: mhahsler
- Kind: user
- Location: Dallas, TX
- Company: SMU
- Website: http://michael.hahsler.net
- Repositories: 32
- Profile: https://github.com/mhahsler
I develop packages for AI, ML, and Data Science.
GitHub Events
Total
- Create event: 1
- Release event: 1
- Issues event: 6
- Watch event: 28
- Issue comment event: 3
- Push event: 10
- Fork event: 2
Last Year
- Create event: 1
- Release event: 1
- Issues event: 6
- Watch event: 28
- Issue comment event: 3
- Push event: 10
- Fork event: 2
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| mhahsler | m****l@h****t | 257 |
| Matt Piekenbrock | m****k@g****m | 26 |
| Maximilian Muecke | m****n@g****m | 18 |
| Matt Piekenbrock | m****m | 5 |
| cmalzer | 1****r | 1 |
| Zach Schuster | z****r@g****m | 1 |
| Taekyun Kim | t****m@h****m | 1 |
| James Lamb | j****0@g****m | 1 |
| Erich Schubert | k****0 | 1 |
Committer Domains (Top 20 + Academic)
hahsler.net: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 51
- Total pull requests: 28
- Average time to close issues: about 1 year
- Average time to close pull requests: 3 months
- Total issue authors: 47
- Total pull request authors: 10
- Average comments per issue: 3.65
- Average comments per pull request: 1.5
- Merged pull requests: 24
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 0
- Average time to close issues: 2 days
- Average time to close pull requests: N/A
- Issue authors: 3
- Pull request authors: 0
- Average comments per issue: 0.67
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- m-muecke (3)
- mhahsler (2)
- moredatapls (2)
- joeroe (2)
- kmzapp (1)
- zachmayer (1)
- jwijffels (1)
- kageazusa (1)
- ZhenyiWangTHU (1)
- akarlinsky (1)
- Sandy4321 (1)
- sverchkov (1)
- ankhnesmerira (1)
- GearFear (1)
- elbamos (1)
Pull Request Authors
- m-muecke (35)
- moredatapls (2)
- cmalzer (2)
- peekxc (2)
- mhahsler (2)
- kno10 (2)
- eduardokapp (1)
- zschuster (1)
- taekyunk (1)
- jameslamb (1)
Top Labels
Issue Labels
enhancement (12)
bug (10)
question (7)
invalid (2)
help wanted (2)
Pull Request Labels
bug (1)
Packages
- Total packages: 2
-
Total downloads:
- cran 29,201 last-month
- Total docker downloads: 47,892
-
Total dependent packages: 60
(may contain duplicates) -
Total dependent repositories: 126
(may contain duplicates) - Total versions: 36
- Total maintainers: 1
cran.r-project.org: dbscan
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms
- Homepage: https://github.com/mhahsler/dbscan
- Documentation: http://cran.r-project.org/web/packages/dbscan/dbscan.pdf
- License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
-
Latest release: 1.2.3
published 6 months ago
Rankings
Forks count: 1.3%
Stargazers count: 1.7%
Dependent packages count: 1.7%
Dependent repos count: 1.9%
Downloads: 2.9%
Average: 3.1%
Docker downloads count: 9.2%
Maintainers (1)
Last synced:
6 months ago
conda-forge.org: r-dbscan
- Homepage: https://github.com/mhahsler/dbscan
- License: GPL-2.0-or-later
-
Latest release: 1.1_11
published over 3 years ago
Rankings
Dependent repos count: 17.9%
Dependent packages count: 19.6%
Average: 21.5%
Forks count: 23.8%
Stargazers count: 24.5%
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran
- Rcpp >= 1.0.0 imports
- graphics * imports
- stats * imports
- dendextend * suggests
- fpc * suggests
- igraph * suggests
- knitr * suggests
- microbenchmark * suggests
- rmarkdown * suggests
- testthat * suggests