dsalmon

dSalmon is a framework for analyzing data streams. CN contact: Alexander Hartl

https://github.com/cn-tu/dsalmon

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

dSalmon is a framework for analyzing data streams. CN contact: Alexander Hartl

Basic Info
  • Host: GitHub
  • Owner: CN-TU
  • License: lgpl-3.0
  • Language: C++
  • Default Branch: master
  • Size: 1.76 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 6 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.rst

dSalmon
=======

.. image:: https://img.shields.io/github/license/CN-TU/dSalmon.svg
   :target: https://github.com/CN-TU/dSalmon/blob/master/LICENSE
   :alt: License
   
.. image:: https://readthedocs.org/projects/dsalmon/badge/?version=latest
   :target: https://dsalmon.readthedocs.io/en/latest/?badge=latest
   :alt: Documentation Status

dSalmon (**D**\ ata **S**\ tream **A**\ nalysis A\ **l**\ gorith\ **m**\ s f\ **o**\ r the Impatie\ **n**\ t) is a framework for analyzing data streams. Implementation of the core algorithms is done in C++, focusing on superior processing speed and allowing even vast amounts of data to be processed. Python bindings are provided to allow seamless integration in data science development.

Installation
------------
dSalmon can be installed from PyPI using

.. code-block:: sh

    pip3 install dSalmon

or directly from our `GitHub repository `_:

.. code-block:: sh

    pip3 install git+https://github.com/CN-TU/dSalmon


Outlier Detectors
-----------------
dSalmon provides several algorithms for detecting outliers in data streams. Usage is easiest using the Python interface, which provides an interface similar to the algorithms from scikit-learn. The following example performs outlier detection with a window size of 100 samples.

.. code-block:: python

    from dSalmon import outlier
    import numpy as np
    from sklearn.datasets import fetch_kddcup99
    from sklearn.preprocessing import minmax_scale
    
    # Let scikit-learn fetch and preprocess some stream data
    kddcup = fetch_kddcup99()
    X = minmax_scale(np.delete(kddcup.data, (1,2,3), 1))

    # Perform outlier detection using a Robust Random Cut Forest
    detector = outlier.SWRRCT(window=100)
    outlier_scores = detector.fit_predict(X)
    print ('Top 10 outliers: ', np.argsort(outlier_scores)[-10:])

Individual rows of the passed data are processed sequentially. Hence, while being substantially faster, the above code is equivalent to the following example.

.. code-block:: python

    from dSalmon import outlier
    import numpy as np
    from sklearn.datasets import fetch_kddcup99
    from sklearn.preprocessing import minmax_scale
    
    kddcup = fetch_kddcup99()
    X = minmax_scale(np.delete(kddcup.data, (1,2,3), 1))

    detector = outlier.SWRRCT(window=100)
    outlier_scores = [ detector.fit_predict(row) for row in X ]
    print ('Top 10 outliers: ', np.argsort(outlier_scores)[-10:])

For an overview of provided outlier detection models, consult `dSalmon's documentation `_.


Obtaining Sliding-Window Statistics
-----------------------------------
Concept drift frequently requires computing statistics based on the most recently observed `N` data samples, since earlier portions of the stream are no longer relevant for the current point in time.

dSalmon provides a `StatisticsTree `_, which allows to compute sliding-window statistics efficiently. The following listing provides an example for usage computing the average and 75% percentile of data observed in a sliding window of length 100:

.. code-block:: python

    from dSalmon.trees import StatisticsTree
    import numpy as np

    data = np.random.rand(1000,2)

    tree = StatisticsTree(window=100, what=['average'], quantiles=[0.75])
    stats, sw_counts = tree.fit_query(data)
    print ('Averages:', stats[:,0,:])
    print ('75% percentiles:', stats[:,1,:])

`StatisticsTree `_ allows simultaneously querying various statistics. By relying on tree-based methods, time complexity is linear in window length, paving the way for analyzing streams with large memory lengths. 

Stream Scaling
--------------
Performing traditional scaling for streaming data is unrealistic, since in a practical scenario it would involve using data observed in future for scaling. Furthermore, due to concept drift, preprocessing and postprocessing for stream data frequently require scaling values with regard to recently observed values. dSalmon provides tools for these tasks, allowing to perform `z-score scaling `_ and `quantile scaling `_  based on statistics observed in a sliding window. The following example performs outlier detection as demonstrated above, but uses sliding window-based z-score scaling for preprocessing:

.. code-block:: python

    from dSalmon import outlier
    from dSalmon.scalers import SWZScoreScaler
    import numpy as np
    from sklearn.datasets import fetch_kddcup99
    
    # Let scikit-learn fetch and preprocess some stream data
    kddcup = fetch_kddcup99()

    scaler = SWZScoreScaler(window=1000)
    X = scaler.transform(np.delete(kddcup.data, (1,2,3), 1))

    # Omit the first `window` points to avoid transient effects
    X = X[1000:]

    # Perform outlier detection using a Robust Random Cut Forest
    detector = outlier.SWRRCT(window=100)
    outlier_scores = detector.fit_predict(X)
    print ('Top 10 outliers: ', np.argsort(outlier_scores)[-10:])

Efficient Nearest-Neighbor Queries
----------------------------------
dSalmon uses an `M-Tree `_ for several of its algorithms. An M-Tree is a spatial indexing data structure for metric spaces, allowing fast nearest-neighbor and range queries. The benefit of an M-Tree compared to, e.g., a KD-Tree or Ball-Tree is that insertion, updating and removal of points is fast after having built the tree.

For the development of custom algorithms, an M-Tree interface is provided for Python.
A point within a tree can be accessed either via ``tree[k]`` using the point's key ``k``, or via ``tree.ix[i]`` using the point's index ``i``. Keys can be arbitrary integers and are returned by ``insert()``, ``knn()`` and
``neighbors()``. Indices are integers in the range ``0...len(tree)``, sorted according to the points' keys in ascending order.

KNN queries can be performed using the ``knn()`` function and range queries can be performed using the ``neighbors()`` function.

The following example shows how to modify points within a tree and how to find nearest neighbors.

.. code-block:: python

    from dSalmon.trees import MTree
    import numpy as np

    tree = MTree()

    # insert a point [1,2,3,4] with key 5
    tree[5] = [1,2,3,4]

    # insert some random test data
    X = np.random.rand(1000,4)
    inserted_keys = tree.insert(X)

    # delete every second point
    del tree.ix[::2]

    # Set the coordinates of the point with the lowest key
    tree.ix[0] = [0,0,0,0]

    # find the 3 nearest neighbors to [0.5, 0.5, 0.5, 0.5]
    neighbor_keys, neighbor_distances, _ = tree.knn([.5,.5,.5,.5], k=3)
    print ('Neighbor keys:', neighbor_keys)
    print ('Neighbor distances:', neighbor_distances)

    # find all neighbors to [0.5, 0.5, 0.5, 0.5] within a radius of 0.2
    neighbor_keys, neighbor_distances, _ = tree.neighbors([.5,.5,.5,.5], radius=0.2)
    print ('Neighbor keys:', neighbor_keys)
    print ('Neighbor distances:', neighbor_distances)


Extending dSalmon
-----------------

dSalmon uses `SWIG `_ for generating wrapper code for the C++ core algorithms and instantiates single and double precision floating point variants of each algorithm.

Architecture
^^^^^^^^^^^^

The ``cpp`` folder contains the code for the C++ core algorithms, which might be used directly by C++ projects.

When using dSalmon from Python, the C++ algorithms are wrapped by the interfaces in the SWIG folder. These wrapper functions are translated to a Python interface and have the main purpose of providing an interface which can easily be parsed by SWIG.

Finally, the ``python`` folder contains the Python interface invoking the Python interface provided by SWIG.

Rebuilding
^^^^^^^^^^

When adding new algorithms or modifying the interface, the SWIG wrappers have to be rebuilt. To this end, SWIG has to be installed and a ``pip`` package can be created and installed  using

.. code-block:: sh

    make && pip3 install dSalmon.tar.xz

Acknowledgements
----------------
This work was supported by the project MALware cOmmunication in cRitical Infrastructures (MALORI), funded by the Austrian security research program KIRAS of the Federal Ministry for Agriculture, Regions and Tourism (BMLRT) under grant no. 873511.

Owner

  • Name: CN Group, Institute of Telecommunications, TU Wien
  • Login: CN-TU
  • Kind: organization
  • Location: Vienna, Austria

Communication Networks Group, TU Wien

Citation (CITATION.cff)

cff-version: "1.1.0"
title: dSalmon
authors: 
  - family-names: Hartl
    given-names: Alexander
    affiliation: "TU Wien"
    orcid: "https://orcid.org/0000-0003-4376-9605"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/CN-TU/dSalmon"

GitHub Events

Total
Last Year

Committers

Last synced: about 3 years ago

All Time
  • Total Commits: 76
  • Total Committers: 2
  • Avg Commits per committer: 38.0
  • Development Distribution Score (DDS): 0.013
Top Committers
Name Email Commits
Alexander Hartl a****l@t****t 75
Alexander Hartl a****l@g****t 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 8 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 1
  • Total maintainers: 1
pypi.org: dsalmon

dSalmon is a framework for analyzing data streams

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 8 Last month
Rankings
Dependent packages count: 10.1%
Dependent repos count: 21.6%
Forks count: 29.8%
Stargazers count: 31.9%
Average: 32.3%
Downloads: 68.0%
Maintainers (1)
Last synced: 8 months ago

Dependencies

docs/requirements.txt pypi
  • numpy *
  • sphinx_rtd_theme ==1.0.0
  • sphinxcontrib-bibtex ==2.4.1
setup.py pypi
  • numpy *
pyproject.toml pypi