https://github.com/cn-tu/pysdoclust-stream

incremental stream clustering algorithm based on SDO

https://github.com/cn-tu/pysdoclust-stream

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

incremental stream clustering algorithm based on SDO

Basic Info
  • Host: GitHub
  • Owner: CN-TU
  • License: gpl-3.0
  • Language: C++
  • Default Branch: main
  • Size: 296 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

pysdoclust-stream

SDOstreamclust

Incremental stream clustering (and outlier detection) algorithm based on Sparse Data Observers (SDO).

SDOstreamclust is suitable for large, multi-dimensional datasets where clusters are statistically well represented.


Dependencies

SDOstreamclust requires numpy.


Installation

SDOstreamclust can be installed from the main branch:

    pip3 install git+https://github.com/CN-TU/pysdoclust-stream.git@main

or simply download the main branch and run:

    pip3 install pysdoclust-stream-main.zip  


Folder Structure and Evaluation Experiments

The [cpp] folder contains the code for the C++ core algorithms, which might be used directly in C++ projects.

When using SDOstreamclust from Python, the C++ algorithms are wrapped by the interfaces in the [swig] folder. These wrapper functions are translated to a Python interface and have the main purpose of providing an interface which can easily be parsed by SWIG.

The [python] folder contains the Python interface invoking the Python interface provided by SWIG.

Finally, complete experiments, datasets, scripts and results conducted for the paper Stream Clustering Robust to Concept Drift are provided in the [evaluation_tests] folder of the "evaluation" branch. They have been tested with Python v3.8.14.

A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust


Example

SDOstreamclust is a straighforward algorithm and very easy to configure. The main parameters are the number of observers k, which determines the size of the model and the parameter T, which defines the memory of the algorithm.

Setting the right k (default=300) depends on the variability of the data and the expected number of clusters, but this is quite a robust parameter that gives proper performances with values between [200,500] in most scenarios. On the other hand, T (default=500) sets the model dynamics and inertia. Intuitively, it is the number of points processed that results in a fully replaced model (on average). Low T is recommended when the data show very fast dynamics, while if data evolution is slow and retaining old clusters is dedired, T should be set with high values.

Additionally, input_buffer (default=0) establishes how many points are necessary for the observers to update the internal clustering. This fundamentally affects the processing speed. Most scenarios commonly tolerate high values in the input_buffer without significantly affecting the accuracy performance. Beyond the mentioned ones, other parameters are inherited from SDOclust and SDOstream and do not usually require adjustment. They are described in python/clustering.py file.

The following example code retrieves a data stream and initialize SDOstreamclust.

```python from SDOstreamclust import clustering import numpy as np import pandas as pd

df = pd.readcsv('example/dataset.csv') t = df['timestamp'].tonumpy() x = df[['f0','f1']].tonumpy() y = df['label'].tonumpy()

k = 200 # Model size T = 400 # Time Horizon ibuff = 10 # input buffer classifier = clustering.SDOstreamclust(k=k, T=T, input_buffer=ibuff) ```

In the piece of code below the stream data is processed point by point. SDOstreamclust provides a clustering label and an outlierness score per point. It can also perform outlier thresholding internally by giving the label -1 to outliers. To do this, outlier_handling=True must be set and the outlier_threshold (default=5) adjusted.

```python allpredic = [] allscores = []

blocksize = 1 # per-point processing for i in range(0, x.shape[0], blocksize): chunk = x[i:i + blocksize, :] chunktime = t[i:i + blocksize] labels, outlierscores = classifier.fitpredict(chunk, chunktime) allpredic.append(labels) allscores.append(outlierscores) p = np.concatenate(allpredic) # clustering labels s = np.concatenate(all_scores) # outlierness scores s = -1/(s+1) # norm. to avoid inf scores

Thresholding top outliers based on Chebyshev's inequality (88.9%)

th = np.mean(s)+3*np.std(s) p[s>th]=-1

Evaluation metrics

from sklearn.metrics.cluster import adjustedrandscore from sklearn.metrics import rocaucscore print("Adjusted Rand Index (clustering):", adjustedrandscore(y,p)) print("ROC AUC score (outlier/anomaly detection):", rocaucscore(y<0,s)) ```

Giving ARI=0.97 and ROC-AUC=0.99. Note how SDOstreamclust assigns high outlierness scores to the first points of emerging clusters.

Owner

  • Name: CN Group, Institute of Telecommunications, TU Wien
  • Login: CN-TU
  • Kind: organization
  • Location: Vienna, Austria

Communication Networks Group, TU Wien

GitHub Events

Total
  • Push event: 9
Last Year
  • Push event: 9

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 6 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
pypi.org: pysdoclust-stream

SDOstreamclust is an algorithm for clustering data streams

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 6 Last month
Rankings
Dependent packages count: 10.0%
Average: 33.2%
Dependent repos count: 56.3%
Maintainers (1)
Last synced: 10 months ago

Dependencies

pyproject.toml pypi
setup.py pypi
  • numpy *