rrcf
rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams - Published in JOSS (2019)
Science Score: 95.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
✓Committers with academic emails
3 of 7 committers (42.9%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Scientific Fields
Repository
🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
Basic Info
- Host: GitHub
- Owner: kLabUM
- License: mit
- Language: Python
- Default Branch: master
- Homepage: https://klabum.github.io/rrcf/
- Size: 4.45 MB
Statistics
- Stars: 515
- Watchers: 18
- Forks: 113
- Open Issues: 31
- Releases: 7
Topics
Metadata Files
README.md
rrcf 🌲🌲🌲
Implementation of the Robust Random Cut Forest Algorithm for anomaly detection by Guha et al. (2016).
S. Guha, N. Mishra, G. Roy, & O. Schrijvers, Robust random cut forest based anomaly detection on streams, in Proceedings of the 33rd International conference on machine learning, New York, NY, 2016 (pp. 2712-2721).
About
The Robust Random Cut Forest (RRCF) algorithm is an ensemble method for detecting outliers in streaming data. RRCF offers a number of features that many competing anomaly detection algorithms lack. Specifically, RRCF:
- Is designed to handle streaming data.
- Performs well on high-dimensional data.
- Reduces the influence of irrelevant dimensions.
- Gracefully handles duplicates and near-duplicates that could otherwise mask the presence of outliers.
- Features an anomaly-scoring algorithm with a clear underlying statistical meaning.
This repository provides an open-source implementation of the RRCF algorithm and its core data structures for the purposes of facilitating experimentation and enabling future extensions of the RRCF algorithm.
Documentation
Read the docs here 📖.
Installation
Use pip to install rrcf via pypi:
shell
$ pip install rrcf
Currently, only Python 3 is supported.
Dependencies
The following dependencies are required to install and use rrcf:
- numpy (>= 1.15)
The following optional dependencies are required to run the examples shown in the documentation:
- pandas (>= 0.23)
- scipy (>= 1.2)
- scikit-learn (>= 0.20)
- matplotlib (>= 3.0)
Listed version numbers have been tested and are known to work (this does not necessarily preclude older versions).
Robust random cut trees
A robust random cut tree (RRCT) is a binary search tree that can be used to detect outliers in a point set. A RRCT can be instantiated from a point set. Points can also be added and removed from an RRCT.
Creating the tree
```python import numpy as np import rrcf
A (robust) random cut tree can be instantiated from a point set (n x d)
X = np.random.randn(100, 2) tree = rrcf.RCTree(X)
A random cut tree can also be instantiated with no points
tree = rrcf.RCTree() ```
Inserting points
```python tree = rrcf.RCTree()
for i in range(6): x = np.random.randn(2) tree.insert_point(x, index=i) ```
─+
├───+
│ ├───+
│ │ ├──(0)
│ │ └───+
│ │ ├──(5)
│ │ └──(4)
│ └───+
│ ├──(2)
│ └──(3)
└──(1)
Deleting points
tree.forget_point(2)
─+
├───+
│ ├───+
│ │ ├──(0)
│ │ └───+
│ │ ├──(5)
│ │ └──(4)
│ └──(3)
└──(1)
Anomaly score
The likelihood that a point is an outlier is measured by its collusive displacement (CoDisp): if including a new point significantly changes the model complexity (i.e. bit depth), then that point is more likely to be an outlier.
```python
Seed tree with zero-mean, normally distributed data
X = np.random.randn(100,2) tree = rrcf.RCTree(X)
Generate an inlier and outlier point
inlier = np.array([0, 0]) outlier = np.array([4, 4])
Insert into tree
tree.insertpoint(inlier, index='inlier') tree.insertpoint(outlier, index='outlier') ```
```python tree.codisp('inlier')
1.75 ```
```python tree.codisp('outlier')
39.0 ```
Batch anomaly detection
This example shows how a robust random cut forest can be used to detect outliers in a batch setting. Outliers correspond to large CoDisp.
```python import numpy as np import pandas as pd import rrcf
Set parameters
np.random.seed(0) n = 2010 d = 3 numtrees = 100 treesize = 256
Generate data
X = np.zeros((n, d)) X[:1000,0] = 5 X[1000:2000,0] = -5 X += 0.01np.random.randn(X.shape)
Construct forest
forest = [] while len(forest) < numtrees: # Select random subsets of points uniformly from point set ixs = np.random.choice(n, size=(n // treesize, treesize), replace=False) # Add sampled trees to forest trees = [rrcf.RCTree(X[ix], indexlabels=ix) for ix in ixs] forest.extend(trees)
Compute average CoDisp
avgcodisp = pd.Series(0.0, index=np.arange(n)) index = np.zeros(n) for tree in forest: codisp = pd.Series({leaf : tree.codisp(leaf) for leaf in tree.leaves}) avgcodisp[codisp.index] += codisp np.add.at(index, codisp.index.values, 1) avg_codisp /= index ```

Streaming anomaly detection
This example shows how the algorithm can be used to detect anomalies in streaming time series data.
```python import numpy as np import rrcf
Generate data
n = 730 A = 50 center = 100 phi = 30 T = 2np.pi/100 t = np.arange(n) sin = Anp.sin(Tt-phiT) + center sin[235:255] = 80
Set tree parameters
numtrees = 40 shinglesize = 4 tree_size = 256
Create a forest of empty trees
forest = [] for _ in range(num_trees): tree = rrcf.RCTree() forest.append(tree)
Use the "shingle" generator to create rolling window
points = rrcf.shingle(sin, size=shingle_size)
Create a dict to store anomaly score of each point
avg_codisp = {}
For each shingle...
for index, point in enumerate(points): # For each tree in the forest... for tree in forest: # If tree is above permitted size, drop the oldest point (FIFO) if len(tree.leaves) > treesize: tree.forgetpoint(index - treesize) # Insert the new point into the tree tree.insertpoint(point, index=index) # Compute codisp on the new point and take the average among all trees if not index in avgcodisp: avgcodisp[index] = 0 avgcodisp[index] += tree.codisp(index) / numtrees ```

Obtain feature importance
This example shows how to estimate the feature importance using the dimension of cut obtained during the calculation of the CoDisp.
```python import numpy as np import pandas as pd import rrcf
Set parameters
np.random.seed(0) n = 2010 d = 3 numtrees = 100 treesize = 256
Generate data
X = np.zeros((n, d)) X[:1000,0] = 5 X[1000:2000,0] = -5 X += 0.01np.random.randn(X.shape)
Construct forest
forest = [] while len(forest) < numtrees: # Select random subsets of points uniformly from point set ixs = np.random.choice(n, size=(n // treesize, treesize), replace=False) # Add sampled trees to forest trees = [rrcf.RCTree(X[ix], indexlabels=ix) for ix in ixs] forest.extend(trees)
Compute average CoDisp with the cut dimension for each point
dimcodisp = np.zeros([n,d],dtype=float) index = np.zeros(n) for tree in forest: for leaf in tree.leaves: codisp,cutdim = tree.codispwithcutdimension(leaf)
dim_codisp[leaf,cutdim] += codisp
index[leaf] += 1
avgcodisp = dimcodisp.sum(axis=1)/index
codisp anomaly threshold and calculate the mean over each feature
featureimportanceanomaly = np.mean(dimcodisp[avgcodisp>50,:],axis=0)
create a dataframe with the feature importance
dffeatureimportance = pd.DataFrame(featureimportanceanomaly,columns=['featureimportance'])
dffeature_importance
```

Contributing
We welcome contributions to the rrcf repo. To contribute, submit a pull request to the dev branch.
Types of contributions
Some suggested types of contributions include:
- Bug fixes
- Documentation improvements
- Performance enhancements
- Extensions to the algorithm
Check the issue tracker for any specific issues that need help. If you encounter a problem using rrcf, or have an idea for an extension, feel free to raise an issue.
Guidelines for contributors
Please consider the following guidelines when contributing to the codebase:
- Ensure that any new methods, functions or classes include docstrings. Docstrings should include a description of the code, as well as descriptions of the inputs (arguments) and outputs (returns). Providing an example use case is recommended (see existing methods for examples).
- Write unit tests for any new code and ensure that all tests are passing with no warnings. Please ensure that overall code coverage does not drop below 80%.
Running unit tests
To run unit tests, first ensure that pytest and pytest-cov are installed:
$ pip install pytest pytest-cov
To run the tests, navigate to the root directory of the repo and run:
$ pytest --cov=rrcf/
Citing
If you have used this codebase in a publication and wish to cite it, please use the Journal of Open Source Software article.
M. Bartos, A. Mullapudi, & S. Troutman, rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams, in: Journal of Open Source Software, The Open Journal, Volume 4, Number 35. 2019
bibtex
@article{bartos_2019_rrcf,
title={{rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams}},
authors={Matthew Bartos and Abhiram Mullapudi and Sara Troutman},
journal={{The Journal of Open Source Software}},
volume={4},
number={35},
pages={1336},
year={2019}
}
Owner
- Name: Real-time water systems lab
- Login: kLabUM
- Kind: organization
- Repositories: 28
- Profile: https://github.com/kLabUM
JOSS Publication
rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams
Authors
Tags
outlier detection machine learning ensemble methods random forestsGitHub Events
Total
- Issues event: 1
- Watch event: 18
- Issue comment event: 3
- Pull request event: 1
- Fork event: 1
Last Year
- Issues event: 1
- Watch event: 18
- Issue comment event: 3
- Pull request event: 1
- Fork event: 1
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Matt Bartos | m****s@u****u | 192 |
| Sara Troutman | s****m@u****u | 15 |
| Abhiram | a****m@u****u | 12 |
| javier.vera | j****a@p****m | 5 |
| arrrrrmin | i****o@d****o | 1 |
| Matthew Whitworth | m****t@t****m | 1 |
| Krishna Sangeeth | k****h@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 47
- Total pull requests: 57
- Average time to close issues: about 1 month
- Average time to close pull requests: 3 days
- Total issue authors: 34
- Total pull request authors: 12
- Average comments per issue: 2.13
- Average comments per pull request: 0.32
- Merged pull requests: 49
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 2.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mdbartos (10)
- TrigonaMinima (3)
- stianvale (2)
- shaye059 (2)
- qiezikuangge (1)
- shihyulee (1)
- futtery (1)
- liuguiyangnwpu (1)
- adavoli91 (1)
- vinura (1)
- duarteharris (1)
- zcsh (1)
- lalitsomnathe (1)
- ashujainml (1)
- colinkyle (1)
Pull Request Authors
- mdbartos (42)
- yasirroni (3)
- mwhitworth (3)
- onixlas (2)
- abhiramm7 (2)
- vc1492a (1)
- inregards2pluto (1)
- whiletruelearn (1)
- JavierVeraOlmos (1)
- nikosgavalas (1)
- arrrrrmin (1)
- liuguiyangnwpu (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- numpy *
