Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: fgrivet
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 26.5 MB
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created 7 months ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Outlier Detection for Data Streams (ODDS)

ODDS Optimized

This repository is a fork of the original ODDS. Several functions have been optimized for better performance.

Polynomial Basis

The optimized polynomial_basis are: - monomials: optimized_monomials, - Chebyshev T1: chebyshev_t_1_matrix,

Incremental Updates

To update the matrix (incr_opt), sherman has been optimized for rank-1 updates and woodbury has been added for efficient rank-K updates when $K < \frac{matrix_size}{3}$. If K is larger, use incr_opt = "inverse".

Inversion Method

The fpd_inv method has been modified to perform even on ill-conditioned matrices.

Other Optimizations

The DyCF.fit and DyCF.score_samples methods have been optimized for better performance and now uses $M = \frac{1}{N} X^Tn Xn$ instead of $M = \frac{1}{N} \sum{i=0}^N \mathbf{v}(\mathbf{x}i) \mathbf{v}(\mathbf{x}_i)^T$.

The PolynomialsBasis.generate_combinations method has been optimized using a recursive approach to generate combinations of polynomial degrees, which is more efficient than the previous iterative method.

Description

This code is associated with a scientific article and allows retrieving its results. It can also be used as a framework for detecting outliers in data streams.

How to install

With Python 3.11 installed, you can either run the following command within the repository folder, which can be useful to modify some parts:

shell pip install -e .

or install from git:

shell pip install git+https://github.com/fgrivet/odds_optimized.git

Possibly outdated instructions

The following instructions may be outdated, please refer to the examples folder for up-to-date information.

How to retrieve the experimental results presented in our article

Three scripts have been used in order to get the presented results: - scripts/comparing_cd_kde.py: used to plot the graph comparing Christoffel Function and KDE as scoring functions, - scripts/christoffel_article_expe.py: used to compute the outlier scores and performance metrics on the datasets from data directory, - scripts/christoffel_article_plots.py: used to plot the results based on a csv file generated by christoffel_article_expe.py.

The required datasets are stored in the data folder.

Comparing Christoffel Function and KDE

Executing this script generate two files: cf_vs_kde.csv containing AUROC and AP results and cf_vs_kde.png containing the graph in the article.

Computing outlier scores and performance metrics

This script execute all the required tests for the experiment section in the article. During processing, a temp folder is made that contains partial results. This helps because the full process take several days.

The final results are saved in two csv files: - results_conveyors_experiment.csv for conveyor data streams, - results_synthetics_experiment.csv for synthetic data streams.

Plotting results

The script to plot the results need to be executed after scripts/christoffel_article_expe.py since it requires the resulting csv files.

This script generates one png file per "equipment" (barplot) and another png file per experiment (table with mean and standard deviation results).

How to use as a framework

If you want to use this repository as a framework, note that you can use newer version of Python or other libraries, changing the requirements in setup.py.

Implemented methods

Here we describe the different methods already implemented in this framework and their parameters.

Statistical methods

KDE

Kernel Density Estimation with sliding windows

Attributes
----------
threshold: float
    the threshold on the pdf, if the pdf computed for a point is greater than the threshold then the point is considered normal
win_size: int
    size of the window of kernel centers to keep in memory
kernel: str, optional
    the type of kernel to use, can be either "gaussian" or "epanechnikov" (default is "gaussian")
bandwidth: str, optional
    rule of thumb to compute the bandwidth, "scott" is the only one implemented for now (default is "scott")
SmartSifter

Smart Sifter reduced to continuous domain only with its Sequentially Discounting Expectation and Maximizing (SDEM) algorithm (see https://github.com/sk1010k/SmartSifter and https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html)

Attributes
----------
threshold: float
    the threshold on the pdf, if the pdf computed for a point is greater than the threshold then the point is considered normal
k: int
    number of gaussian mixture components ("n_components" from sklearn.mixture.GaussianMixture)
r: float
    discounting parameter for the SDEM algorithm ("r" from smartsifter.SDEM)
alpha: float
    stability parameter for the weights of gaussian mixture components ("alpha" from smartsifter.SDEM)
scoring_function: str
    scoring function used, either "logloss" for logarithmic loss or "hellinger" for hellinger score, both proposed by the original article,
    or "likelihood" for the likelihood that a point is issued from the learned mixture (default is "likelihood")
DyCF

Dynamical Christoffel Function

Attributes
----------
d: int
    the degree of polynomials, usually set between 2 and 8
incr_opt: str, optional
    can be either "inverse" to inverse the moments matrix each iteration or "sherman" to use the Sherman-Morrison formula (default is "inv")
polynomial_basis: str, optional
    polynomial basis used to compute moment matrix, either "monomials", "chebyshev_t_1", "chebyshev_t_2", "chebyshev_u" or "legendre", 
    varying this parameter can bring stability to the score in some cases (default is "monomials")
regularization: str, optional
    one of "vu" (score divided by d^{3p/2}), "vu_C" (score divided by d^{3p/2}/C), "comb" (score divided by comb(p+d, d)) or "none" (no regularization), "none" is used for cf vs mkde comparison (default is "vu_C")
C: float, optional
    define a threshold on the score when used with regularization="vu_C", usually C<=1 (default is 1)
inv: str, optional
    inversion method, one of "inv" for classical matrix inversion or "pinv" for Moore-Penrose pseudo-inversion (default is "inv")
DyCG

Dynamical Christoffel Growth

Attributes
----------
degrees: ndarray, optional
    the degrees of at least two DyCF models inside (default is np.array([2, 8]))
dycf_kwargs:
    see DyCF args others than d

Distance-based methods

DBOKDE

Distance-Based Outliers by Kernel Density Estimation

Attributes
----------
k: int
    a threshold on the number of neighbours needed to consider the point as normal
R: float or str
    the distance defining the neighborhood around a point, can be computed dynamically, in this case set R="dynamic"
win_size: int
    the number of points in the sliding window used in neighbours count
sample_size: int
    the number of points used as kernel centers for the KDE, if sample_size=-1 then sample_size is set to win_size (default is -1)

Density-based methods

ILOF

Incremental Local Outlier Factor

Attributes
----------
k: int
    the number of neighbors to compute the LOF score on
threshold: float
    a threshold on the LOF score to separate normal points and outliers
win_size: int
    number of points in the sliding window used in kNN search
min_size: int (optional)
    minimal number of points in a node, it is mandatory that 2 <= min_size <= max_size / 2 (default is 3)
max_size: int (optional)
    maximal number of points in a node, it is mandatory that 2 <= min_size <= max_size / 2 (default is 12)
p_reinsert_tol: int (optional)
    tolerance on reinsertion, used to deal with overflow in a node (default is 4)
reinsert_strategy: str (optional)
    either "close" or "far", tells if we try to reinsert in the closest rectangles first on in the farthest (default is "close")

How to use outlier detection methods

The outlier detection methods implement the BaseDetector abstract class with the following methods:

Methods
-------
fit(x)
    Generates a model that fit dataset x.
update(x)
    Updates the current model with instances in x.
score_samples(x)
    Makes the model compute the outlier score of samples in x (the higher the value, the more outlying the sample).
decision_function(x)
    This is similar to score_samples(x) except outliers have negative score and inliers positive ones.
predict(x)
    Returns the sign (-1 or 1) of decision_function(x).
eval_update(x)
    Computes decision_function of each sample and then update the model with this sample.
predict_update(x)
    Same as eval_update(x) but returns the prediction instead of the decision function.
method_name()
    Returns the name of the method.
save_model()
    Returns a dict of all the model attributes allowing to save the model in bases such as mongodb.
load_model(model_dict)
    Reload a previously saved model based on the output of the save_model() method.
copy()
    Returns a copy of the model.

How to compare methods on labelled datasets

Datasets need to be csv files with a first line as header, a first column as indexes, following columns as variables and a last one as labels (1 for inliers and -1 for outliers).

For instance, a dataset containing two instances: * the vector [0, 0] being an inlier, * the vector [1, 1] being an outlier,

could be written as: csv 0,x1,x2,y 0,0,0,1 1,1,1,-1

Define methods and parameterizations

Methods used for evaluation should be defined following the example in scripts/evaluating_methods_example.py in a METHOD variable as a list of dictionaries.

The dictionaries elements contain the following fields: * name: complete name to give the method, * method: class name, * params: another dictionary where the fields are the parameters names and the value are their values, * short_name: a shorter name for the method to appear in the plot and as file names for saved results.

Set the datasets

Datasets should be set in a data_dict dictionary.

In order to do this, it is recommended to first define the split position as split_pos of the training and testing part of the dataset as an int.

Then, load the dataset into a data variable calling utils.load_dataset method with the path to the dataset (theoretically, it should be ../res/dataset_name.csv).

Finally, you just need to set the dataset into data_dict as in the following example (note that each dataset in data_dict need a different name):

```python from oddsoptimized.utils import loaddataset, splitdata data = loaddataset("path/to/file.csv") splitpos = len(data) // 2 # for instance, if you want to split the dataset in two almost equal parts xtrain, ytrain, xtest, ytest = splitdata(data, split_pos)

datadict = dict() datadict["datasetname"] = { "xtest": xtest, "xtrain": xtrain, "ytest": y_test, } ```

Run evaluation

You just have to call utils.compute_and_save and utils.plot_and_save_results with the variables defined earlier:

python from odds_optimized.utils import compute_and_save, plot_and_save_results METHODS = [ # list of methods dicts as explained in "Define methods and parameterizations" section ] data_dict = { # datasets as explained in "Set the datasets" section } compute_and_save(METHODS, data_dict, "/path/to/res") # here "/path/to/res" is used as a header for the files containing pickle models plot_and_save_results(METHODS, data_dict, "/path/to/res") # here "/path/to/res" is used as a header for the files containing graphs and metrics

This will generate the resulting files in the /path/to folder. The pickle files save the results in order to avoid computing the model again each time. If you want to recompute a model, the associated pickle file has to be deleted manually. One csv file is generated for each dataset, containing the results for each method. One png file is generated, showing the results in a plot.

Owner

  • Login: fgrivet
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
title: ODDS
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Kevin
    family-names: Ducharlet
    email: kevin.ducharlet@berger-levrault.com
    affiliation: Carl Berger-Levrault
    orcid: 'https://orcid.org/0000-0003-0053-8874'
repository-code: 'https://github.com/kyducharlet/odds'
license: GPL-3.0

GitHub Events

Total
  • Push event: 1
  • Create event: 1
Last Year
  • Push event: 1
  • Create event: 1

Dependencies

setup.py pypi
  • matplotlib ==3.8.1
  • numpy ==1.26.1
  • openpyxl ==3.2.0b1
  • pandas ==2.1.2
  • pympler ==1.0.1
  • scikit-learn ==1.3.2
  • scipy ==1.11.3
  • seaborn ==0.13.0
  • smartsifter ==0.1.1.dev1
  • tqdm ==4.66.1