rrcf

rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams - Published in JOSS (2019)

https://github.com/klabum/rrcf

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    3 of 7 committers (42.9%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

anomaly-detection detect-outliers machine-learning outliers python random-forest robust-random-cut-forest streaming-data tree

Scientific Fields

Engineering Computer Science - 40% confidence
Last synced: 4 months ago · JSON representation

Repository

🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams

Basic Info
Statistics
  • Stars: 515
  • Watchers: 18
  • Forks: 113
  • Open Issues: 31
  • Releases: 7
Topics
anomaly-detection detect-outliers machine-learning outliers python random-forest robust-random-cut-forest streaming-data tree
Created about 7 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

rrcf 🌲🌲🌲

Build Status Coverage Status Python 3.6 GitHub status

Implementation of the Robust Random Cut Forest Algorithm for anomaly detection by Guha et al. (2016).

S. Guha, N. Mishra, G. Roy, & O. Schrijvers, Robust random cut forest based anomaly detection on streams, in Proceedings of the 33rd International conference on machine learning, New York, NY, 2016 (pp. 2712-2721).

About

The Robust Random Cut Forest (RRCF) algorithm is an ensemble method for detecting outliers in streaming data. RRCF offers a number of features that many competing anomaly detection algorithms lack. Specifically, RRCF:

  • Is designed to handle streaming data.
  • Performs well on high-dimensional data.
  • Reduces the influence of irrelevant dimensions.
  • Gracefully handles duplicates and near-duplicates that could otherwise mask the presence of outliers.
  • Features an anomaly-scoring algorithm with a clear underlying statistical meaning.

This repository provides an open-source implementation of the RRCF algorithm and its core data structures for the purposes of facilitating experimentation and enabling future extensions of the RRCF algorithm.

Documentation

Read the docs here 📖.

Installation

Use pip to install rrcf via pypi:

shell $ pip install rrcf

Currently, only Python 3 is supported.

Dependencies

The following dependencies are required to install and use rrcf:

The following optional dependencies are required to run the examples shown in the documentation:

Listed version numbers have been tested and are known to work (this does not necessarily preclude older versions).

Robust random cut trees

A robust random cut tree (RRCT) is a binary search tree that can be used to detect outliers in a point set. A RRCT can be instantiated from a point set. Points can also be added and removed from an RRCT.

Creating the tree

```python import numpy as np import rrcf

A (robust) random cut tree can be instantiated from a point set (n x d)

X = np.random.randn(100, 2) tree = rrcf.RCTree(X)

A random cut tree can also be instantiated with no points

tree = rrcf.RCTree() ```

Inserting points

```python tree = rrcf.RCTree()

for i in range(6): x = np.random.randn(2) tree.insert_point(x, index=i) ```

─+ ├───+ │ ├───+ │ │ ├──(0) │ │ └───+ │ │ ├──(5) │ │ └──(4) │ └───+ │ ├──(2) │ └──(3) └──(1)

Deleting points

tree.forget_point(2)

─+ ├───+ │ ├───+ │ │ ├──(0) │ │ └───+ │ │ ├──(5) │ │ └──(4) │ └──(3) └──(1)

Anomaly score

The likelihood that a point is an outlier is measured by its collusive displacement (CoDisp): if including a new point significantly changes the model complexity (i.e. bit depth), then that point is more likely to be an outlier.

```python

Seed tree with zero-mean, normally distributed data

X = np.random.randn(100,2) tree = rrcf.RCTree(X)

Generate an inlier and outlier point

inlier = np.array([0, 0]) outlier = np.array([4, 4])

Insert into tree

tree.insertpoint(inlier, index='inlier') tree.insertpoint(outlier, index='outlier') ```

```python tree.codisp('inlier')

1.75 ```

```python tree.codisp('outlier')

39.0 ```

Batch anomaly detection

This example shows how a robust random cut forest can be used to detect outliers in a batch setting. Outliers correspond to large CoDisp.

```python import numpy as np import pandas as pd import rrcf

Set parameters

np.random.seed(0) n = 2010 d = 3 numtrees = 100 treesize = 256

Generate data

X = np.zeros((n, d)) X[:1000,0] = 5 X[1000:2000,0] = -5 X += 0.01np.random.randn(X.shape)

Construct forest

forest = [] while len(forest) < numtrees: # Select random subsets of points uniformly from point set ixs = np.random.choice(n, size=(n // treesize, treesize), replace=False) # Add sampled trees to forest trees = [rrcf.RCTree(X[ix], indexlabels=ix) for ix in ixs] forest.extend(trees)

Compute average CoDisp

avgcodisp = pd.Series(0.0, index=np.arange(n)) index = np.zeros(n) for tree in forest: codisp = pd.Series({leaf : tree.codisp(leaf) for leaf in tree.leaves}) avgcodisp[codisp.index] += codisp np.add.at(index, codisp.index.values, 1) avg_codisp /= index ```

Image

Streaming anomaly detection

This example shows how the algorithm can be used to detect anomalies in streaming time series data.

```python import numpy as np import rrcf

Generate data

n = 730 A = 50 center = 100 phi = 30 T = 2np.pi/100 t = np.arange(n) sin = Anp.sin(Tt-phiT) + center sin[235:255] = 80

Set tree parameters

numtrees = 40 shinglesize = 4 tree_size = 256

Create a forest of empty trees

forest = [] for _ in range(num_trees): tree = rrcf.RCTree() forest.append(tree)

Use the "shingle" generator to create rolling window

points = rrcf.shingle(sin, size=shingle_size)

Create a dict to store anomaly score of each point

avg_codisp = {}

For each shingle...

for index, point in enumerate(points): # For each tree in the forest... for tree in forest: # If tree is above permitted size, drop the oldest point (FIFO) if len(tree.leaves) > treesize: tree.forgetpoint(index - treesize) # Insert the new point into the tree tree.insertpoint(point, index=index) # Compute codisp on the new point and take the average among all trees if not index in avgcodisp: avgcodisp[index] = 0 avgcodisp[index] += tree.codisp(index) / numtrees ```

Image

Obtain feature importance

This example shows how to estimate the feature importance using the dimension of cut obtained during the calculation of the CoDisp.

```python import numpy as np import pandas as pd import rrcf

Set parameters

np.random.seed(0) n = 2010 d = 3 numtrees = 100 treesize = 256

Generate data

X = np.zeros((n, d)) X[:1000,0] = 5 X[1000:2000,0] = -5 X += 0.01np.random.randn(X.shape)

Construct forest

forest = [] while len(forest) < numtrees: # Select random subsets of points uniformly from point set ixs = np.random.choice(n, size=(n // treesize, treesize), replace=False) # Add sampled trees to forest trees = [rrcf.RCTree(X[ix], indexlabels=ix) for ix in ixs] forest.extend(trees)

Compute average CoDisp with the cut dimension for each point

dimcodisp = np.zeros([n,d],dtype=float) index = np.zeros(n) for tree in forest: for leaf in tree.leaves: codisp,cutdim = tree.codispwithcutdimension(leaf)

    dim_codisp[leaf,cutdim] += codisp 

    index[leaf] += 1

avgcodisp = dimcodisp.sum(axis=1)/index

codisp anomaly threshold and calculate the mean over each feature

featureimportanceanomaly = np.mean(dimcodisp[avgcodisp>50,:],axis=0)

create a dataframe with the feature importance

dffeatureimportance = pd.DataFrame(featureimportanceanomaly,columns=['featureimportance']) dffeature_importance ``` Image

Contributing

We welcome contributions to the rrcf repo. To contribute, submit a pull request to the dev branch.

Types of contributions

Some suggested types of contributions include:

  • Bug fixes
  • Documentation improvements
  • Performance enhancements
  • Extensions to the algorithm

Check the issue tracker for any specific issues that need help. If you encounter a problem using rrcf, or have an idea for an extension, feel free to raise an issue.

Guidelines for contributors

Please consider the following guidelines when contributing to the codebase:

  • Ensure that any new methods, functions or classes include docstrings. Docstrings should include a description of the code, as well as descriptions of the inputs (arguments) and outputs (returns). Providing an example use case is recommended (see existing methods for examples).
  • Write unit tests for any new code and ensure that all tests are passing with no warnings. Please ensure that overall code coverage does not drop below 80%.

Running unit tests

To run unit tests, first ensure that pytest and pytest-cov are installed:

$ pip install pytest pytest-cov

To run the tests, navigate to the root directory of the repo and run:

$ pytest --cov=rrcf/

Citing

If you have used this codebase in a publication and wish to cite it, please use the Journal of Open Source Software article.

M. Bartos, A. Mullapudi, & S. Troutman, rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams, in: Journal of Open Source Software, The Open Journal, Volume 4, Number 35. 2019

bibtex @article{bartos_2019_rrcf, title={{rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams}}, authors={Matthew Bartos and Abhiram Mullapudi and Sara Troutman}, journal={{The Journal of Open Source Software}}, volume={4}, number={35}, pages={1336}, year={2019} }

Owner

  • Name: Real-time water systems lab
  • Login: kLabUM
  • Kind: organization

JOSS Publication

rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
Published
March 29, 2019
Volume 4, Issue 35, Page 1336
Authors
Matthew D. Bartos ORCID
Department of Civil and Environmental Engineering, University of Michigan
Abhiram Mullapudi ORCID
Department of Civil and Environmental Engineering, University of Michigan
Sara C. Troutman ORCID
Department of Civil and Environmental Engineering, University of Michigan
Editor
Viviane Pons ORCID
Tags
outlier detection machine learning ensemble methods random forests

GitHub Events

Total
  • Issues event: 1
  • Watch event: 18
  • Issue comment event: 3
  • Pull request event: 1
  • Fork event: 1
Last Year
  • Issues event: 1
  • Watch event: 18
  • Issue comment event: 3
  • Pull request event: 1
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 227
  • Total Committers: 7
  • Avg Commits per committer: 32.429
  • Development Distribution Score (DDS): 0.154
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Matt Bartos m****s@u****u 192
Sara Troutman s****m@u****u 15
Abhiram a****m@u****u 12
javier.vera j****a@p****m 5
arrrrrmin i****o@d****o 1
Matthew Whitworth m****t@t****m 1
Krishna Sangeeth k****h@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 47
  • Total pull requests: 57
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 3 days
  • Total issue authors: 34
  • Total pull request authors: 12
  • Average comments per issue: 2.13
  • Average comments per pull request: 0.32
  • Merged pull requests: 49
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mdbartos (10)
  • TrigonaMinima (3)
  • stianvale (2)
  • shaye059 (2)
  • qiezikuangge (1)
  • shihyulee (1)
  • futtery (1)
  • liuguiyangnwpu (1)
  • adavoli91 (1)
  • vinura (1)
  • duarteharris (1)
  • zcsh (1)
  • lalitsomnathe (1)
  • ashujainml (1)
  • colinkyle (1)
Pull Request Authors
  • mdbartos (42)
  • yasirroni (3)
  • mwhitworth (3)
  • onixlas (2)
  • abhiramm7 (2)
  • vc1492a (1)
  • inregards2pluto (1)
  • whiletruelearn (1)
  • JavierVeraOlmos (1)
  • nikosgavalas (1)
  • arrrrrmin (1)
  • liuguiyangnwpu (1)
Top Labels
Issue Labels
bug (2)
Pull Request Labels

Dependencies

setup.py pypi
  • numpy *