dbscan

Theoretically Efficient and Practical Parallel DBSCAN

https://github.com/wangyiqiu/dbscan-python

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, acm.org
✓
Committers with academic emails
1 of 6 committers (16.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary

Keywords

clustering data-clustering data-mining dbscan dbscan-clustering machine-learning parallel-algorithm python

Last synced: 6 months ago · JSON representation

Repository

Theoretically Efficient and Practical Parallel DBSCAN

Basic Info

Host: GitHub
Owner: wangyiqiu
License: mit
Language: C++
Default Branch: master
Homepage: https://sites.google.com/view/yiqiuwang/dbscan
Size: 339 KB

Statistics

Stars: 89
Watchers: 2
Forks: 18
Open Issues: 4
Releases: 0

Topics

clustering data-clustering data-mining dbscan dbscan-clustering machine-learning parallel-algorithm python

Created about 5 years ago · Last pushed 8 months ago

Metadata Files

Readme License

Theoretically-Efficient and Practical Parallel DBSCAN

Overview

This repository hosts fast parallel DBSCAN clustering code for low dimensional Euclidean space. The code automatically uses the available threads on a parallel shared-memory machine to speedup DBSCAN clustering. It stems from a paper presented in SIGMOD'20: Theoretically Efficient and Practical Parallel DBSCAN.

Our software on 1 thread is on par with all serial state-of-the-art DBSCAN packages, and provides additional speedup via multi-threading. Below, we show a simple benchmark comparing our code with the DBSCAN implementation of Sklearn, tested on a 6-core computer with 2-way hyperthreading using a 2-dimensional data set with 50000 data points, where both implementation uses all available threads. Our implementation is more than 32x faster. We also show a visualization of the clustering result on a smaller data set.

Data sets with dimensionality 2 - 20 are supported by default, which can be modified by modifying DBSCAN_MIN_DIMS and DBSCAN_MAX_DIMS in the source code.

timing example

Tutorial

Option 1: Use the Python binding

There are two ways to install it:

Install it using PyPI: pip3 install --user dbscan (you can find the wheels here).
To build from scratch for testing: pip3 install -e . from the project root directory.

An example for using the Python module is provided in example.py. It generates the clustering example above.

Python API

from dbscan import DBSCAN labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)

Input

X: A 2-D Numpy array containing the input data points. The first dimension of X is the number of data points n, and the second dimension is the data set dimensionality (the maximum supported dimensionality is 20).
eps: The epsilon parameter (default 0.5).
min_samples: The minPts parameter (default 5).

Output

labels: A length n Numpy array (dtype=np.int32) containing cluster IDs of the data points, in the same ordering as the input data. Noise points are given a pseudo-ID of -1.
core_samples_mask: A length n Numpy array (dtype=np.bool) masking the core points, in the same ordering as the input data.

We provide a complete example below that generates a toy data set, computes the DBSCAN clustering, and visualizes the result as shown in the plot above.

``` import numpy as np from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler

Generate sample data

centers = [[1, 1], [-1, -1], [1, -1]] X, labelstrue = makeblobs(nsamples=750, centers=centers, clusterstd=0.4, randomstate=0) X = StandardScaler().fittransform(X)

Compute DBSCAN

direct call of the C API:

from dbscan import DBSCAN labels, coresamplesmask = DBSCAN(X, eps=0.3, min_samples=10)

OR calling our sklearn API:

from dbscan import sklDBSCAN as DBSCAN

db = DBSCAN(eps=0.3, min_samples=10).fit(X)

coresamplesmask = np.zeroslike(db.labels, dtype=bool)

coresamplesmask[db.coresampleindices_] = True

labels = db.labels_

Plot result

import matplotlib.pyplot as plt

nclusters = len(set(labels)) - (1 if -1 in labels else 0) nnoise = list(labels).count(-1)

Black removed and is used for noise instead.

uniquelabels = set(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(uniquelabels))] for k, col in zip(uniquelabels, colors): if k == -1: # Black used for noise. col = [0, 0, 0, 1] classmembermask = (labels == k) xy = X[classmembermask & coresamplesmask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=14) xy = X[classmembermask & ~coresamplesmask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('Estimated number of clusters: %d' % nclusters_) plt.show() ```

Option 2: Use the binary executable

Compile and run the program:

mkdir build cd build cmake .. cd executable make -j # this will take a while ./dbscan -eps 0.1 -minpts 10 -o clusters.txt <data-file>

The <data-file> can be any CSV-like point data file, where each line contains a data point -- see an example here. The data file can be either with or without header. The cluster output clusters.txt will contain a cluster ID on each line (other than the first-line header), giving a cluster assignment in the same ordering as the input file. A noise point will have a cluster ID of -1.

Option 3: Include directly in your own C++ program

Create your own caller header and source file by instantiating the DBSCAN template function in "dbscan/algo.h".

dbscan.h: ```c++ template int DBSCAN(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);

// equivalent to // int DBSCAN(intT n, floatT PF[n][dim], double epsilon, intT minPts, bool coreFlagOut[n], intT coreFlag[n], intT cluster[n]) // if C++ syntax was a little more flexible

template<> int DBSCAN<3>(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster); ```

dbscan.cpp: ```c++

include "dbscan/algo.h"

include "dbscan.h"

```

Calling the instantiated function: c++ int n = ...; // number of data points double data[n][3] = ...; // data points int labels[n]; // label ids get saved here bool core_samples[n]; // a flag determining whether or not the sample is a core sample is saved here { int ignore[n]; DBSCAN<3>(n, (void*)data, 70, 100, core_samples, ignore, labels); }

Doing this will only compile the function for the number of dimensions that you want, which saves on compilation time.

You can also include the "dbscan/capi.h" and define your own DBSCAN_MIN_DIMS and DBSCAN_MAX_DIMS macros the same way the Python extension uses it. The function exported has the following signature. c++ extern "C" int DBSCAN(int dim, int n, double* PF, double epsilon, int minPts, bool* coreFlag, int* cluster);

Right now, the only two files that are guaranteed to remain in the C/C++ API are "dbscan/algo.h" and "dbscan/capi.h" and the functions named DBSCAN within.

Citation

If you use our work in a publication, we would appreciate citations:

@inproceedings{wang2020theoretically,
  author = {Wang, Yiqiu and Gu, Yan and Shun, Julian},
  title = {Theoretically-Efficient and Practical Parallel DBSCAN},
  year = {2020},
  isbn = {9781450367356},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3318464.3380582},
  doi = {10.1145/3318464.3380582},
  booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
  pages = {2555–2571},
  numpages = {17},
  keywords = {parallel algorithms, spatial clustering, DBScan},
  location = {Portland, OR, USA},
  series = {SIGMOD ’20}
}

Owner

Name: Yiqiu Wang
Login: wangyiqiu
Kind: user
Company: MIT

Website: https://sites.google.com/view/yiqiuwang
Repositories: 2
Profile: https://github.com/wangyiqiu

GitHub Events

Total

Create event: 6
Issues event: 6
Watch event: 17
Delete event: 6
Member event: 1
Issue comment event: 8
Push event: 17
Pull request event: 4
Pull request review event: 5
Pull request review comment event: 4
Fork event: 2

Last Year

Create event: 6
Issues event: 6
Watch event: 17
Delete event: 6
Member event: 1
Issue comment event: 8
Push event: 17
Pull request event: 4
Pull request review event: 5
Pull request review comment event: 4
Fork event: 2

Committers

Last synced: over 2 years ago

All Time

Total Commits: 76
Total Committers: 6
Avg Commits per committer: 12.667
Development Distribution Score (DDS): 0.579

Past Year

Commits: 30
Committers: 2
Avg Commits per committer: 15.0
Development Distribution Score (DDS): 0.133

Top Committers

Name	Email	Commits
Yiqiu Wang	y**g@i**m	32
Yiqiu Wang	y**k@g**m	26
anivegesana	a**a@g**m	9
Iurii Ivanov	i**i@p**o	6
anivegesana	a**a@b**m	2
Yiqiu Wang	y**w@m**u	1

Committer Domains (Top 20 + Academic)

mit.edu: 1 bluerivertech.com: 1 proviz.io: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 17
Total pull requests: 12
Average time to close issues: 7 months
Average time to close pull requests: 7 days
Total issue authors: 15
Total pull request authors: 4
Average comments per issue: 3.24
Average comments per pull request: 3.83
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 7
Average time to close issues: 7 months
Average time to close pull requests: 2 days
Issue authors: 4
Pull request authors: 2
Average comments per issue: 0.25
Average comments per pull request: 0.43
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

william-silversmith (2)
amir-keramatian (2)
chi2liu (1)
sleeplessTLV (1)
nikisalli (1)
ShJacub (1)
zp-yang (1)
kevinjohncutler (1)
adamwawrzynski (1)
shteren1 (1)
mcmingchang (1)
sun11 (1)
detrin (1)
jacksongoode (1)
mohammedsalah98 (1)

Pull Request Authors

John-194 (5)
anivegesana (4)
skbill84 (2)
iurii-provizio (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 11,944 last-month

Total dependent packages: 1
Total dependent repositories: 2
Total versions: 9
Total maintainers: 2

pypi.org: dbscan

Theoretically efficient and practical parallel DBSCAN

Homepage: https://github.com/wangyiqiu/dbscan-python
Documentation: https://dbscan.readthedocs.io/
License: MIT
Latest release: 1.0.0
published 8 months ago

Versions: 9
Dependent Packages: 1
Dependent Repositories: 2
Downloads: 11,944 Last month

Rankings

Downloads: 6.0%

Stargazers count: 9.6%

Average: 9.7%

Dependent packages count: 10.0%

Forks count: 11.4%

Dependent repos count: 11.6%

Maintainers (2)

anivegesana yiqiuw

Last synced: 6 months ago

Dependencies

setup.py pypi

numpy >=

.github/workflows/build_wheels.yml actions

actions/checkout v3 composite
actions/download-artifact v3 composite
actions/upload-artifact v3 composite
docker/setup-qemu-action v2 composite
pypa/cibuildwheel v2.11.2 composite
pypa/gh-action-pypi-publish v1.5.0 composite

pyproject.toml pypi

dbscan

Science Score: 46.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Theoretically-Efficient and Practical Parallel DBSCAN

Overview

Tutorial

Option 1: Use the Python binding

Python API

Input

Output

Generate sample data

Compute DBSCAN

direct call of the C API:

OR calling our sklearn API:

from dbscan import sklDBSCAN as DBSCAN

db = DBSCAN(eps=0.3, min_samples=10).fit(X)

coresamplesmask = np.zeroslike(db.labels, dtype=bool)

coresamplesmask[db.coresampleindices_] = True

labels = db.labels_

Plot result

Black removed and is used for noise instead.

Option 2: Use the binary executable

Option 3: Include directly in your own C++ program

include "dbscan/algo.h"

include "dbscan.h"

Citation

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: dbscan

Rankings

Maintainers (2)

Dependencies