PyNomaly

PyNomaly: Anomaly detection using Local Outlier Probabilities (LoOP). - Published in JOSS (2018)

Keywords

anomalies anomaly-detection machine-learning nearest-neighbors outlier-detection outlier-scores outliers probability

Last synced: 6 months ago · JSON representation

Repository

Anomaly detection using LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].

Basic Info

Host: GitHub
Owner: vc1492a
License: other
Language: Python
Default Branch: main
Homepage:
Size: 34.1 MB

Statistics

Stars: 327
Watchers: 24
Forks: 38
Open Issues: 10
Releases: 1

Topics

anomalies anomaly-detection machine-learning nearest-neighbors outlier-detection outlier-scores outliers probability

Created almost 9 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog Funding License

PyNomaly

PyNomaly is a Python 3 implementation of LoOP (Local Outlier Probabilities). LoOP is a local density based outlier detection method by Kriegel, Kröger, Schubert, and Zimek which provides outlier scores in the range of [0,1] that are directly interpretable as the probability of a sample being an outlier.

PyNomaly is a core library of deepchecks, OmniDocBench and pysad.

The outlier score of each sample is called the Local Outlier Probability. It measures the local deviation of density of a given sample with respect to its neighbors as Local Outlier Factor (LOF), but provides normalized outlier scores in the range [0,1]. These outlier scores are directly interpretable as a probability of an object being an outlier. Since Local Outlier Probabilities provides scores in the range [0,1], practitioners are free to interpret the results according to the application.

Like LOF, it is local in that the anomaly score depends on how isolated the sample is with respect to the surrounding neighborhood. Locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that lie in regions of lower density compared to their neighbors and thus identify samples that may be outliers according to their Local Outlier Probability.

The authors' 2009 paper detailing LoOP's theory, formulation, and application is provided by Ludwig-Maximilians University Munich - Institute for Informatics; LoOP: Local Outlier Probabilities.

Implementation

This Python 3 implementation uses Numpy and the formulas outlined in LoOP: Local Outlier Probabilities to calculate the Local Outlier Probability of each sample.

Dependencies

Python 3.8 - 3.13
numpy >= 1.16.3
python-utils >= 2.3.0
(optional) numba >= 0.45.1

Numba just-in-time (JIT) compiles the function with calculates the Euclidean distance between observations, providing a reduction in computation time (significantly when a large number of observations are scored). Numba is not a requirement and PyNomaly may still be used solely with numpy if desired (details below).

Quick Start

First install the package from the Python Package Index:

shell pip install PyNomaly # or pip3 install ... if you're using both Python 3 and 2.

Alternatively, you can use conda to install the package from conda-forge:

shell conda install conda-forge::pynomaly Then you can do something like this:

python from PyNomaly import loop m = loop.LocalOutlierProbability(data).fit() scores = m.local_outlier_probabilities print(scores) where data is a NxM (N rows, M columns; 2-dimensional) set of data as either a Pandas DataFrame or Numpy array.

LocalOutlierProbability sets the extent (in integer in value of 1, 2, or 3) and n_neighbors (must be greater than 0) parameters with the default values of 3 and 10, respectively. You're free to set these parameters on your own as below:

python from PyNomaly import loop m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20).fit() scores = m.local_outlier_probabilities print(scores)

This implementation of LoOP also includes an optional cluster_labels parameter. This is useful in cases where regions of varying density occur within the same set of data. When using cluster_labels, the Local Outlier Probability of a sample is calculated with respect to its cluster assignment.

python from PyNomaly import loop from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.6, min_samples=50).fit(data) m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20, cluster_labels=list(db.labels_)).fit() scores = m.local_outlier_probabilities print(scores)

NOTE: Unless your data is all the same scale, it may be a good idea to normalize your data with z-scores or another normalization scheme prior to using LoOP, especially when working with multiple dimensions of varying scale. Users must also appropriately handle missing values prior to using LoOP, as LoOP does not support Pandas DataFrames or Numpy arrays with missing values.

Utilizing Numba and Progress Bars

It may be helpful to use just-in-time (JIT) compilation in the cases where a lot of observations are scored. Numba, a JIT compiler for Python, may be used with PyNomaly by setting use_numba=True:

python from PyNomaly import loop m = loop.LocalOutlierProbability(data, extent=2, n_neighbors=20, use_numba=True, progress_bar=True).fit() scores = m.local_outlier_probabilities print(scores)

Numba must be installed if the above to use JIT compilation and improve the speed of multiple calls to LocalOutlierProbability(), and PyNomaly has been tested with Numba version 0.45.1. An example of the speed difference that can be realized with using Numba is avaialble in examples/numba_speed_diff.py.

You may also choose to print progress bars with our without the use of numba by passing progress_bar=True to the LocalOutlierProbability() method as above.

Choosing Parameters

The extent parameter controls the sensitivity of the scoring in practice. The parameter corresponds to the statistical notion of an outlier defined as an object deviating more than a given lambda (extent) times the standard deviation from the mean. A value of 2 implies outliers deviating more than 2 standard deviations from the mean, and corresponds to 95.0% in the empirical "three-sigma" rule. The appropriate parameter should be selected according to the level of sensitivity needed for the input data and application. The question to ask is whether it is more reasonable to assume outliers in your data are 1, 2, or 3 standard deviations from the mean, and select the value likely most appropriate to your data and application.

The n_neighbors parameter defines the number of neighbors to consider about each sample (neighborhood size) when determining its Local Outlier Probability with respect to the density of the sample's defined neighborhood. The idea number of neighbors to consider is dependent on the input data. However, the notion of an outlier implies it would be considered as such regardless of the number of neighbors considered. One potential approach is to use a number of different neighborhood sizes and average the results for reach observation. Those observations which rank highly with varying neighborhood sizes are more than likely outliers. This is one potential approach of selecting the neighborhood size. Another is to select a value proportional to the number of observations, such an odd-valued integer close to the square root of the number of observations in your data (sqrt(n_observations).

Iris Data Example

We'll be using the well-known Iris dataset to show LoOP's capabilities. There's a few things you'll need for this example beyond the standard prerequisites listed above: - matplotlib 2.0.0 or greater - PyDataset 0.2.0 or greater - scikit-learn 0.18.1 or greater

First, let's import the packages and libraries we will need for this example.

python from PyNomaly import loop import pandas as pd from pydataset import data import numpy as np from sklearn.cluster import DBSCAN import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D

Now let's create two sets of Iris data for scoring; one with clustering and the other without.

```python

import the data and remove any non-numeric columns

iris = pd.DataFrame(data('iris').drop(columns=['Species'])) ```

Next, let's cluster the data using DBSCAN and generate two sets of scores. On both cases, we will use the default values for both extent (0.997) and n_neighbors (10).

python db = DBSCAN(eps=0.9, min_samples=10).fit(iris) m = loop.LocalOutlierProbability(iris).fit() scores_noclust = m.local_outlier_probabilities m_clust = loop.LocalOutlierProbability(iris, cluster_labels=list(db.labels_)).fit() scores_clust = m_clust.local_outlier_probabilities

Organize the data into two separate Pandas DataFrames.

python iris_clust = pd.DataFrame(iris.copy()) iris_clust['scores'] = scores_clust iris_clust['labels'] = db.labels_ iris['scores'] = scores_noclust

And finally, let's visualize the scores provided by LoOP in both cases (with and without clustering).

```python fig = plt.figure(figsize=(7, 7)) ax = fig.addsubplot(111, projection='3d') ax.scatter(iris['Sepal.Width'], iris['Petal.Width'], iris['Sepal.Length'], c=iris['scores'], cmap='seismic', s=50) ax.setxlabel('Sepal.Width') ax.setylabel('Petal.Width') ax.setzlabel('Sepal.Length') plt.show() plt.clf() plt.cla() plt.close()

fig = plt.figure(figsize=(7, 7)) ax = fig.addsubplot(111, projection='3d') ax.scatter(irisclust['Sepal.Width'], irisclust['Petal.Width'], irisclust['Sepal.Length'], c=irisclust['scores'], cmap='seismic', s=50) ax.setxlabel('Sepal.Width') ax.setylabel('Petal.Width') ax.setzlabel('Sepal.Length') plt.show() plt.clf() plt.cla() plt.close()

fig = plt.figure(figsize=(7, 7)) ax = fig.addsubplot(111, projection='3d') ax.scatter(irisclust['Sepal.Width'], irisclust['Petal.Width'], irisclust['Sepal.Length'], c=irisclust['labels'], cmap='Set1', s=50) ax.setxlabel('Sepal.Width') ax.setylabel('Petal.Width') ax.setzlabel('Sepal.Length') plt.show() plt.clf() plt.cla() plt.close() ```

Your results should look like the following:

LoOP Scores without Clustering

LoOP Scores with Clustering

DBSCAN Cluster Assignments

Note the differences between using LocalOutlierProbability with and without clustering. In the example without clustering, samples are scored according to the distribution of the entire data set. In the example with clustering, each sample is scored according to the distribution of each cluster. Which approach is suitable depends on the use case.

NOTE: Data was not normalized in this example, but it's probably a good idea to do so in practice.

Using Numpy

When using numpy, make sure to use 2-dimensional arrays in tabular format:

```python data = np.array([ [43.3, 30.2, 90.2], [62.9, 58.3, 49.3], [55.2, 56.2, 134.2], [48.6, 80.3, 50.3], [67.1, 60.0, 55.9], [421.5, 90.3, 50.0] ])

scores = loop.LocalOutlierProbability(data, nneighbors=3).fit().localoutlier_probabilities print(scores)

```

The shape of the input array shape corresponds to the rows (observations) and columns (features) in the data:

```python print(data.shape)

(6,3), which matches number of observations and features in the above example

```

Similar to the above:

python data = np.random.rand(100, 5) scores = loop.LocalOutlierProbability(data).fit().local_outlier_probabilities print(scores)

Specifying a Distance Matrix

PyNomaly provides the ability to specify a distance matrix so that any distance metric can be used (a neighbor index matrix must also be provided). This can be useful when wanting to use a distance other than the euclidean.

Note that in order to maintain alignment with the LoOP definition of closest neighbors, an additional neighbor is added when using scikit-learn's NearestNeighbors since NearestNeighbors includes the point itself when calculating the cloest neighbors (whereas the LoOP method does not include distances to point itself).

```python import numpy as np from sklearn.neighbors import NearestNeighbors

data = np.array([ [43.3, 30.2, 90.2], [62.9, 58.3, 49.3], [55.2, 56.2, 134.2], [48.6, 80.3, 50.3], [67.1, 60.0, 55.9], [421.5, 90.3, 50.0] ])

Generate distance and neighbor matrices

nneighbors = 3 # the number of neighbors according to the LoOP definition neigh = NearestNeighbors(nneighbors=nneighbors+1, metric='hamming') neigh.fit(data) d, idx = neigh.kneighbors(data, returndistance=True)

Remove self-distances - you MUST do this to preserve the same results as intended by the definition of LoOP

indices = np.delete(indices, 0, 1) distances = np.delete(distances, 0, 1)

Fit and return scores

m = loop.LocalOutlierProbability(distancematrix=d, neighbormatrix=idx, nneighbors=nneighbors+1).fit() scores = m.localoutlierprobabilities ```

The below visualization shows the results by a few known distance metrics:

LoOP Scores by Distance Metric DBSCAN Cluster Assignments

Streaming Data

PyNomaly also contains an implementation of Hamlet et. al.'s modifications to the original LoOP approach [4], which may be used for applications involving streaming data or where rapid calculations may be necessary. First, the standard LoOP algorithm is used on "training" data, with certain attributes of the fitted data stored from the original LoOP approach. Then, as new points are considered, these fitted attributes are called when calculating the score of the incoming streaming data due to the use of averages from the initial fit, such as the use of a global value for the expected value of the probabilistic distance. Despite the potential for increased error when compared to the standard approach, it may be effective in streaming applications where refitting the standard approach over all points could be computationally expensive.

While the iris dataset is not streaming data, we'll use it in this example by taking the first 120 observations as training data and take the remaining 30 observations as a stream, scoring each observation individually.

Split the data. python iris = iris.sample(frac=1) # shuffle data iris_train = iris.iloc[:, 0:4].head(120) iris_test = iris.iloc[:, 0:4].tail(30)

Fit to each set. ```python m = loop.LocalOutlierProbability(iris).fit() scoresnoclust = m.localoutlierprobabilities iris['scores'] = scoresnoclust

mtrain = loop.LocalOutlierProbability(iristrain, nneighbors=10) mtrain.fit() iristrainscores = mtrain.localoutlier_probabilities ```

python iris_test_scores = [] for index, row in iris_test.iterrows(): array = np.array([row['Sepal.Length'], row['Sepal.Width'], row['Petal.Length'], row['Petal.Width']]) iris_test_scores.append(m_train.stream(array)) iris_test_scores = np.array(iris_test_scores)

Concatenate the scores and assess.

```python iris['streamscores'] = np.hstack((iristrainscores, iristest_scores))

iris['scores'] from earlier example

rmse = np.sqrt(((iris['scores'] - iris['stream_scores']) ** 2).mean(axis=None)) print(rmse) ```

The root mean squared error (RMSE) between the two approaches is approximately 0.199 (your scores will vary depending on the data and specification). The plot below shows the scores from the stream approach.

python fig = plt.figure(figsize=(7, 7)) ax = fig.add_subplot(111, projection='3d') ax.scatter(iris['Sepal.Width'], iris['Petal.Width'], iris['Sepal.Length'], c=iris['stream_scores'], cmap='seismic', s=50) ax.set_xlabel('Sepal.Width') ax.set_ylabel('Petal.Width') ax.set_zlabel('Sepal.Length') plt.show() plt.clf() plt.cla() plt.close()

LoOP Scores using Stream Approach with n=10

Notes

When calculating the LoOP score of incoming data, the original fitted scores are not updated. In some applications, it may be beneficial to refit the data periodically. The stream functionality also assumes that either data or a distance matrix (or value) will be used across in both fitting and streaming, with no changes in specification between steps.

Contributing

Please use the issue tracker to report any erroneous behavior or desired feature requests.

If you would like to contribute to development, please fork the repository and make any changes to a branch which corresponds to an open issue. Hot fixes and bug fixes can be represented by branches with the prefix fix/ versus feature/ for new capabilities or code improvements. Pull requests will then be made from these branches into the repository's dev branch prior to being pulled into main.

Commit Messages and Releases

Your commit messages are important - here's why.

PyNomaly leverages release-please to help automate the release process using the Conventional Commits specification. When pull requests are opened to the main branch, release-please will collate the git commit messages and prepare an organized changelog and release notes. This process can be completed because of the Conventional Commits specification.

Conventional Commits provides an easy set of rules for creating an explicit commit history; which makes it easier to write automated tools on top of. This convention dovetails with SemVer, by describing the features, fixes, and breaking changes made in commit messages. You can check out examples here. Make a best effort to use the specification when contributing to Infactory code as it dramatically eases the documentation around releases and their features, breaking changes, bug fixes and documentation updates.

Tests

When contributing, please ensure to run unit tests and add additional tests as necessary if adding new functionality. To run the unit tests, use pytest:

python3 -m pytest --cov=PyNomaly -s -v

To run the tests with Numba enabled, simply set the flag NUMBA in test_loop.py to True. Note that a drop in coverage is expected due to portions of the code being compiled upon code execution.

Versioning

Semantic versioning is used for this project. If contributing, please conform to semantic versioning guidelines when submitting a pull request.

License

This project is licensed under the Apache 2.0 license.

Research

If citing PyNomaly, use the following:

@article{Constantinou2018, doi = {10.21105/joss.00845}, url = {https://doi.org/10.21105/joss.00845}, year = {2018}, month = {oct}, publisher = {The Open Journal}, volume = {3}, number = {30}, pages = {845}, author = {Valentino Constantinou}, title = {{PyNomaly}: Anomaly detection using Local Outlier Probabilities ({LoOP}).}, journal = {Journal of Open Source Software} }

References

Breunig M., Kriegel H.-P., Ng R., Sander, J. LOF: Identifying Density-based Local Outliers. ACM SIGMOD International Conference on Management of Data (2000). PDF.
Kriegel H., Kröger P., Schubert E., Zimek A. LoOP: Local Outlier Probabilities. 18th ACM conference on Information and knowledge management, CIKM (2009). PDF.
Goldstein M., Uchida S. A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. PLoS ONE 11(4): e0152173 (2016).
Hamlet C., Straub J., Russell M., Kerlin S. An incremental and approximate local outlier probability algorithm for intrusion detection and its evaluation. Journal of Cyber Security Technology (2016). DOI.

Acknowledgements

The authors of LoOP (Local Outlier Probabilities)
- Hans-Peter Kriegel
- Peer Kröger
- Erich Schubert
- Arthur Zimek
NASA Jet Propulsion Laboratory
- Kyle Hundman
- Ian Colwell

Owner

Name: Valentino Constantinou
Login: vc1492a
Kind: user
Location: Pasadena, California
Company: Terran Orbital

Website: https://www.valentino.io/
Repositories: 40
Profile: https://github.com/vc1492a

Senior Data Scientist @ Terran Orbital Open Source Software Enthusiast

GitHub Events

Total

Create event: 7
Release event: 1
Issues event: 9
Watch event: 15
Delete event: 6
Issue comment event: 9
Push event: 21
Pull request review comment event: 1
Pull request event: 29
Pull request review event: 4
Fork event: 1

Last Year

Create event: 7
Release event: 1
Issues event: 9
Watch event: 15
Delete event: 6
Issue comment event: 9
Push event: 21
Pull request review comment event: 1
Pull request event: 29
Pull request review event: 4
Fork event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 167
Total Committers: 8
Avg Commits per committer: 20.875
Development Distribution Score (DDS): 0.168

Past Year

Commits: 21
Committers: 2
Avg Commits per committer: 10.5
Development Distribution Score (DDS): 0.476

Top Committers

Name	Email	Commits
vc1492a	v**a@g**m	139
IroNEDR	t**n@g**m	18
Michael Schreier	m**r@g**e	5
Robin	r**e@g**m	1
Lini Mestar	l****o	1
Blake Bambico	b**o@c**m	1
vc1492a	v**u@j**v	1
Joe Jevnik	j**e@q**m	1

Committer Domains (Top 20 + Academic)

quantopian.com: 1 jpl.nasa.gov: 1 codeauthority.com: 1 gmx.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 48
Total pull requests: 53
Average time to close issues: 8 months
Average time to close pull requests: 7 days
Total issue authors: 20
Total pull request authors: 10
Average comments per issue: 1.98
Average comments per pull request: 0.87
Merged pull requests: 37
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 31
Average time to close issues: 12 days
Average time to close pull requests: 2 days
Issue authors: 3
Pull request authors: 3
Average comments per issue: 0.25
Average comments per pull request: 0.1
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

vc1492a (24)
IroNEDR (3)
ali-Eskandarian (2)
lisongs1995 (2)
maxcw (2)
NicoleRR (1)
TSFelg (1)
RockLeeStudio (1)
zhaohan-xi (1)
MichaelSchreier (1)
nghiadanh26 (1)
atthom (1)
jnpsk (1)
ghost (1)
mdruiter (1)

Pull Request Authors

vc1492a (32)
IroNEDR (12)
MichaelSchreier (3)
romajain1 (2)
lmEshoo (1)
paddymul (1)
nghiadanh26 (1)
llllllllll (1)
robmarkcole (1)
ghost (1)

Top Labels

Issue Labels

enhancement (23) bug (10) packaging (6) documentation (5) in progress (5) low priority (3) on hold (3) high priority (2) review (2) good first issue (1) test (1)

Pull Request Labels

documentation (15) autorelease: pending (10) bug (8) review (6) enhancement (3) low priority (3) test (2) in progress (1) help wanted (1) on hold (1)

Packages

Total packages: 2
Total downloads:
- pypi 62,453 last-month

Total dependent packages: 2
(may contain duplicates)
Total dependent repositories: 26
(may contain duplicates)
Total versions: 22
Total maintainers: 2

pypi.org: pynomaly

A Python 3 implementation of LoOP: Local Outlier Probabilities, a local density based outlier detection method providing an outlier score in the range of [0,1].

Homepage: https://github.com/vc1492a/PyNomaly
Documentation: https://pynomaly.readthedocs.io/
License: Apache License, Version 2.0
Latest release: 0.3.4
published over 1 year ago

Versions: 21
Dependent Packages: 1
Dependent Repositories: 26
Downloads: 62,453 Last month

Rankings

Downloads: 1.6%

Dependent repos count: 2.8%

Stargazers count: 3.7%

Average: 3.9%

Dependent packages count: 4.8%

Forks count: 6.7%

Maintainers (2)

vc1492a IroNEDR

Last synced: 6 months ago

conda-forge.org: pynomaly

Homepage: https://github.com/vc1492a/PyNomaly
License: Apache-2.0
Latest release: 0.3.3
published almost 4 years ago

Versions: 1
Dependent Packages: 1
Dependent Repositories: 0

Rankings

Stargazers count: 20.5%

Forks count: 27.1%

Average: 27.6%

Dependent packages count: 28.8%

Dependent repos count: 34.0%

Last synced: 6 months ago

PyNomaly

Science Score: 59.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

readme.md

PyNomaly

Implementation

Dependencies

Quick Start

Utilizing Numba and Progress Bars

Choosing Parameters

Iris Data Example

import the data and remove any non-numeric columns

Using Numpy

(6,3), which matches number of observations and features in the above example

Specifying a Distance Matrix

Generate distance and neighbor matrices

Remove self-distances - you MUST do this to preserve the same results as intended by the definition of LoOP

Fit and return scores

Streaming Data

iris['scores'] from earlier example

Notes

Contributing

Commit Messages and Releases

Tests

Versioning

License

Research

References

Acknowledgements

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pynomaly

Rankings

Maintainers (2)

conda-forge.org: pynomaly

Rankings

Dependencies