dsalmon
dSalmon is a framework for analyzing data streams. CN contact: Alexander Hartl
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 2 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Last synced: 7 months ago
·
JSON representation
·
Repository
dSalmon is a framework for analyzing data streams. CN contact: Alexander Hartl
Basic Info
- Host: GitHub
- Owner: CN-TU
- License: lgpl-3.0
- Language: C++
- Default Branch: master
- Size: 1.76 MB
Statistics
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Created almost 6 years ago
· Last pushed almost 2 years ago
Metadata Files
Readme
License
Citation
README.rst
dSalmon
=======
.. image:: https://img.shields.io/github/license/CN-TU/dSalmon.svg
:target: https://github.com/CN-TU/dSalmon/blob/master/LICENSE
:alt: License
.. image:: https://readthedocs.org/projects/dsalmon/badge/?version=latest
:target: https://dsalmon.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
dSalmon (**D**\ ata **S**\ tream **A**\ nalysis A\ **l**\ gorith\ **m**\ s f\ **o**\ r the Impatie\ **n**\ t) is a framework for analyzing data streams. Implementation of the core algorithms is done in C++, focusing on superior processing speed and allowing even vast amounts of data to be processed. Python bindings are provided to allow seamless integration in data science development.
Installation
------------
dSalmon can be installed from PyPI using
.. code-block:: sh
pip3 install dSalmon
or directly from our `GitHub repository `_:
.. code-block:: sh
pip3 install git+https://github.com/CN-TU/dSalmon
Outlier Detectors
-----------------
dSalmon provides several algorithms for detecting outliers in data streams. Usage is easiest using the Python interface, which provides an interface similar to the algorithms from scikit-learn. The following example performs outlier detection with a window size of 100 samples.
.. code-block:: python
from dSalmon import outlier
import numpy as np
from sklearn.datasets import fetch_kddcup99
from sklearn.preprocessing import minmax_scale
# Let scikit-learn fetch and preprocess some stream data
kddcup = fetch_kddcup99()
X = minmax_scale(np.delete(kddcup.data, (1,2,3), 1))
# Perform outlier detection using a Robust Random Cut Forest
detector = outlier.SWRRCT(window=100)
outlier_scores = detector.fit_predict(X)
print ('Top 10 outliers: ', np.argsort(outlier_scores)[-10:])
Individual rows of the passed data are processed sequentially. Hence, while being substantially faster, the above code is equivalent to the following example.
.. code-block:: python
from dSalmon import outlier
import numpy as np
from sklearn.datasets import fetch_kddcup99
from sklearn.preprocessing import minmax_scale
kddcup = fetch_kddcup99()
X = minmax_scale(np.delete(kddcup.data, (1,2,3), 1))
detector = outlier.SWRRCT(window=100)
outlier_scores = [ detector.fit_predict(row) for row in X ]
print ('Top 10 outliers: ', np.argsort(outlier_scores)[-10:])
For an overview of provided outlier detection models, consult `dSalmon's documentation `_.
Obtaining Sliding-Window Statistics
-----------------------------------
Concept drift frequently requires computing statistics based on the most recently observed `N` data samples, since earlier portions of the stream are no longer relevant for the current point in time.
dSalmon provides a `StatisticsTree `_, which allows to compute sliding-window statistics efficiently. The following listing provides an example for usage computing the average and 75% percentile of data observed in a sliding window of length 100:
.. code-block:: python
from dSalmon.trees import StatisticsTree
import numpy as np
data = np.random.rand(1000,2)
tree = StatisticsTree(window=100, what=['average'], quantiles=[0.75])
stats, sw_counts = tree.fit_query(data)
print ('Averages:', stats[:,0,:])
print ('75% percentiles:', stats[:,1,:])
`StatisticsTree `_ allows simultaneously querying various statistics. By relying on tree-based methods, time complexity is linear in window length, paving the way for analyzing streams with large memory lengths.
Stream Scaling
--------------
Performing traditional scaling for streaming data is unrealistic, since in a practical scenario it would involve using data observed in future for scaling. Furthermore, due to concept drift, preprocessing and postprocessing for stream data frequently require scaling values with regard to recently observed values. dSalmon provides tools for these tasks, allowing to perform `z-score scaling `_ and `quantile scaling `_ based on statistics observed in a sliding window. The following example performs outlier detection as demonstrated above, but uses sliding window-based z-score scaling for preprocessing:
.. code-block:: python
from dSalmon import outlier
from dSalmon.scalers import SWZScoreScaler
import numpy as np
from sklearn.datasets import fetch_kddcup99
# Let scikit-learn fetch and preprocess some stream data
kddcup = fetch_kddcup99()
scaler = SWZScoreScaler(window=1000)
X = scaler.transform(np.delete(kddcup.data, (1,2,3), 1))
# Omit the first `window` points to avoid transient effects
X = X[1000:]
# Perform outlier detection using a Robust Random Cut Forest
detector = outlier.SWRRCT(window=100)
outlier_scores = detector.fit_predict(X)
print ('Top 10 outliers: ', np.argsort(outlier_scores)[-10:])
Efficient Nearest-Neighbor Queries
----------------------------------
dSalmon uses an `M-Tree `_ for several of its algorithms. An M-Tree is a spatial indexing data structure for metric spaces, allowing fast nearest-neighbor and range queries. The benefit of an M-Tree compared to, e.g., a KD-Tree or Ball-Tree is that insertion, updating and removal of points is fast after having built the tree.
For the development of custom algorithms, an M-Tree interface is provided for Python.
A point within a tree can be accessed either via ``tree[k]`` using the point's key ``k``, or via ``tree.ix[i]`` using the point's index ``i``. Keys can be arbitrary integers and are returned by ``insert()``, ``knn()`` and
``neighbors()``. Indices are integers in the range ``0...len(tree)``, sorted according to the points' keys in ascending order.
KNN queries can be performed using the ``knn()`` function and range queries can be performed using the ``neighbors()`` function.
The following example shows how to modify points within a tree and how to find nearest neighbors.
.. code-block:: python
from dSalmon.trees import MTree
import numpy as np
tree = MTree()
# insert a point [1,2,3,4] with key 5
tree[5] = [1,2,3,4]
# insert some random test data
X = np.random.rand(1000,4)
inserted_keys = tree.insert(X)
# delete every second point
del tree.ix[::2]
# Set the coordinates of the point with the lowest key
tree.ix[0] = [0,0,0,0]
# find the 3 nearest neighbors to [0.5, 0.5, 0.5, 0.5]
neighbor_keys, neighbor_distances, _ = tree.knn([.5,.5,.5,.5], k=3)
print ('Neighbor keys:', neighbor_keys)
print ('Neighbor distances:', neighbor_distances)
# find all neighbors to [0.5, 0.5, 0.5, 0.5] within a radius of 0.2
neighbor_keys, neighbor_distances, _ = tree.neighbors([.5,.5,.5,.5], radius=0.2)
print ('Neighbor keys:', neighbor_keys)
print ('Neighbor distances:', neighbor_distances)
Extending dSalmon
-----------------
dSalmon uses `SWIG `_ for generating wrapper code for the C++ core algorithms and instantiates single and double precision floating point variants of each algorithm.
Architecture
^^^^^^^^^^^^
The ``cpp`` folder contains the code for the C++ core algorithms, which might be used directly by C++ projects.
When using dSalmon from Python, the C++ algorithms are wrapped by the interfaces in the SWIG folder. These wrapper functions are translated to a Python interface and have the main purpose of providing an interface which can easily be parsed by SWIG.
Finally, the ``python`` folder contains the Python interface invoking the Python interface provided by SWIG.
Rebuilding
^^^^^^^^^^
When adding new algorithms or modifying the interface, the SWIG wrappers have to be rebuilt. To this end, SWIG has to be installed and a ``pip`` package can be created and installed using
.. code-block:: sh
make && pip3 install dSalmon.tar.xz
Acknowledgements
----------------
This work was supported by the project MALware cOmmunication in cRitical Infrastructures (MALORI), funded by the Austrian security research program KIRAS of the Federal Ministry for Agriculture, Regions and Tourism (BMLRT) under grant no. 873511.
Owner
- Name: CN Group, Institute of Telecommunications, TU Wien
- Login: CN-TU
- Kind: organization
- Location: Vienna, Austria
- Repositories: 16
- Profile: https://github.com/CN-TU
Communication Networks Group, TU Wien
Citation (CITATION.cff)
cff-version: "1.1.0"
title: dSalmon
authors:
- family-names: Hartl
given-names: Alexander
affiliation: "TU Wien"
orcid: "https://orcid.org/0000-0003-4376-9605"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/CN-TU/dSalmon"
GitHub Events
Total
Last Year
Committers
Last synced: about 3 years ago
All Time
- Total Commits: 76
- Total Committers: 2
- Avg Commits per committer: 38.0
- Development Distribution Score (DDS): 0.013
Top Committers
| Name | Commits | |
|---|---|---|
| Alexander Hartl | a****l@t****t | 75 |
| Alexander Hartl | a****l@g****t | 1 |
Committer Domains (Top 20 + Academic)
gmx.at: 1
tuwien.ac.at: 1
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 8 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 1
- Total maintainers: 1
pypi.org: dsalmon
dSalmon is a framework for analyzing data streams
- Homepage: https://github.com/CN-TU/dSalmon
- Documentation: https://dSalmon.readthedocs.io
- License: LGPL-3.0
-
Latest release: 0.1
published over 4 years ago
Rankings
Dependent packages count: 10.1%
Dependent repos count: 21.6%
Forks count: 29.8%
Stargazers count: 31.9%
Average: 32.3%
Downloads: 68.0%
Maintainers (1)
Last synced:
8 months ago
Dependencies
docs/requirements.txt
pypi
- numpy *
- sphinx_rtd_theme ==1.0.0
- sphinxcontrib-bibtex ==2.4.1
setup.py
pypi
- numpy *
pyproject.toml
pypi