odds_optimized
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: fgrivet
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 26.5 MB
Statistics
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Outlier Detection for Data Streams (ODDS)
ODDS Optimized
This repository is a fork of the original ODDS. Several functions have been optimized for better performance.
Polynomial Basis
The optimized polynomial_basis are:
- monomials: optimized_monomials,
- Chebyshev T1: chebyshev_t_1_matrix,
Incremental Updates
To update the matrix (incr_opt), sherman has been optimized for rank-1 updates and woodbury has been added for efficient rank-K updates when $K < \frac{matrix_size}{3}$. If K is larger, use incr_opt = "inverse".
Inversion Method
The fpd_inv method has been modified to perform even on ill-conditioned matrices.
Other Optimizations
The DyCF.fit and DyCF.score_samples methods have been optimized for better performance and now uses $M = \frac{1}{N} X^Tn Xn$ instead of $M = \frac{1}{N} \sum{i=0}^N \mathbf{v}(\mathbf{x}i) \mathbf{v}(\mathbf{x}_i)^T$.
The PolynomialsBasis.generate_combinations method has been optimized using a recursive approach to generate combinations of polynomial degrees, which is more efficient than the previous iterative method.
Description
This code is associated with a scientific article and allows retrieving its results. It can also be used as a framework for detecting outliers in data streams.
How to install
With Python 3.11 installed, you can either run the following command within the repository folder, which can be useful to modify some parts:
shell
pip install -e .
or install from git:
shell
pip install git+https://github.com/fgrivet/odds_optimized.git
Possibly outdated instructions
The following instructions may be outdated, please refer to the examples folder for up-to-date information.
How to retrieve the experimental results presented in our article
Three scripts have been used in order to get the presented results:
- scripts/comparing_cd_kde.py: used to plot the graph comparing Christoffel Function and KDE as scoring functions,
- scripts/christoffel_article_expe.py: used to compute the outlier scores and performance metrics on the datasets from data directory,
- scripts/christoffel_article_plots.py: used to plot the results based on a csv file generated by christoffel_article_expe.py.
The required datasets are stored in the data folder.
Comparing Christoffel Function and KDE
Executing this script generate two files: cf_vs_kde.csv containing AUROC and AP results and cf_vs_kde.png containing the graph in the article.
Computing outlier scores and performance metrics
This script execute all the required tests for the experiment section in the article. During processing, a temp folder is made that contains partial results.
This helps because the full process take several days.
The final results are saved in two csv files:
- results_conveyors_experiment.csv for conveyor data streams,
- results_synthetics_experiment.csv for synthetic data streams.
Plotting results
The script to plot the results need to be executed after scripts/christoffel_article_expe.py since it requires the resulting csv files.
This script generates one png file per "equipment" (barplot) and another png file per experiment (table with mean and standard deviation results).
How to use as a framework
If you want to use this repository as a framework, note that you can use newer version of Python or other libraries, changing the requirements in setup.py.
Implemented methods
Here we describe the different methods already implemented in this framework and their parameters.
Statistical methods
KDE
Kernel Density Estimation with sliding windows
Attributes
----------
threshold: float
the threshold on the pdf, if the pdf computed for a point is greater than the threshold then the point is considered normal
win_size: int
size of the window of kernel centers to keep in memory
kernel: str, optional
the type of kernel to use, can be either "gaussian" or "epanechnikov" (default is "gaussian")
bandwidth: str, optional
rule of thumb to compute the bandwidth, "scott" is the only one implemented for now (default is "scott")
SmartSifter
Smart Sifter reduced to continuous domain only with its Sequentially Discounting Expectation and Maximizing (SDEM) algorithm (see https://github.com/sk1010k/SmartSifter and https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html)
Attributes
----------
threshold: float
the threshold on the pdf, if the pdf computed for a point is greater than the threshold then the point is considered normal
k: int
number of gaussian mixture components ("n_components" from sklearn.mixture.GaussianMixture)
r: float
discounting parameter for the SDEM algorithm ("r" from smartsifter.SDEM)
alpha: float
stability parameter for the weights of gaussian mixture components ("alpha" from smartsifter.SDEM)
scoring_function: str
scoring function used, either "logloss" for logarithmic loss or "hellinger" for hellinger score, both proposed by the original article,
or "likelihood" for the likelihood that a point is issued from the learned mixture (default is "likelihood")
DyCF
Dynamical Christoffel Function
Attributes
----------
d: int
the degree of polynomials, usually set between 2 and 8
incr_opt: str, optional
can be either "inverse" to inverse the moments matrix each iteration or "sherman" to use the Sherman-Morrison formula (default is "inv")
polynomial_basis: str, optional
polynomial basis used to compute moment matrix, either "monomials", "chebyshev_t_1", "chebyshev_t_2", "chebyshev_u" or "legendre",
varying this parameter can bring stability to the score in some cases (default is "monomials")
regularization: str, optional
one of "vu" (score divided by d^{3p/2}), "vu_C" (score divided by d^{3p/2}/C), "comb" (score divided by comb(p+d, d)) or "none" (no regularization), "none" is used for cf vs mkde comparison (default is "vu_C")
C: float, optional
define a threshold on the score when used with regularization="vu_C", usually C<=1 (default is 1)
inv: str, optional
inversion method, one of "inv" for classical matrix inversion or "pinv" for Moore-Penrose pseudo-inversion (default is "inv")
DyCG
Dynamical Christoffel Growth
Attributes
----------
degrees: ndarray, optional
the degrees of at least two DyCF models inside (default is np.array([2, 8]))
dycf_kwargs:
see DyCF args others than d
Distance-based methods
DBOKDE
Distance-Based Outliers by Kernel Density Estimation
Attributes
----------
k: int
a threshold on the number of neighbours needed to consider the point as normal
R: float or str
the distance defining the neighborhood around a point, can be computed dynamically, in this case set R="dynamic"
win_size: int
the number of points in the sliding window used in neighbours count
sample_size: int
the number of points used as kernel centers for the KDE, if sample_size=-1 then sample_size is set to win_size (default is -1)
Density-based methods
ILOF
Incremental Local Outlier Factor
Attributes
----------
k: int
the number of neighbors to compute the LOF score on
threshold: float
a threshold on the LOF score to separate normal points and outliers
win_size: int
number of points in the sliding window used in kNN search
min_size: int (optional)
minimal number of points in a node, it is mandatory that 2 <= min_size <= max_size / 2 (default is 3)
max_size: int (optional)
maximal number of points in a node, it is mandatory that 2 <= min_size <= max_size / 2 (default is 12)
p_reinsert_tol: int (optional)
tolerance on reinsertion, used to deal with overflow in a node (default is 4)
reinsert_strategy: str (optional)
either "close" or "far", tells if we try to reinsert in the closest rectangles first on in the farthest (default is "close")
How to use outlier detection methods
The outlier detection methods implement the BaseDetector abstract class with the following methods:
Methods
-------
fit(x)
Generates a model that fit dataset x.
update(x)
Updates the current model with instances in x.
score_samples(x)
Makes the model compute the outlier score of samples in x (the higher the value, the more outlying the sample).
decision_function(x)
This is similar to score_samples(x) except outliers have negative score and inliers positive ones.
predict(x)
Returns the sign (-1 or 1) of decision_function(x).
eval_update(x)
Computes decision_function of each sample and then update the model with this sample.
predict_update(x)
Same as eval_update(x) but returns the prediction instead of the decision function.
method_name()
Returns the name of the method.
save_model()
Returns a dict of all the model attributes allowing to save the model in bases such as mongodb.
load_model(model_dict)
Reload a previously saved model based on the output of the save_model() method.
copy()
Returns a copy of the model.
How to compare methods on labelled datasets
Datasets need to be csv files with a first line as header, a first column as indexes, following columns as variables and a last one as labels (1 for inliers and -1 for outliers).
For instance, a dataset containing two instances:
* the vector [0, 0] being an inlier,
* the vector [1, 1] being an outlier,
could be written as:
csv
0,x1,x2,y
0,0,0,1
1,1,1,-1
Define methods and parameterizations
Methods used for evaluation should be defined following the example in scripts/evaluating_methods_example.py in a METHOD variable as a list of dictionaries.
The dictionaries elements contain the following fields: * name: complete name to give the method, * method: class name, * params: another dictionary where the fields are the parameters names and the value are their values, * short_name: a shorter name for the method to appear in the plot and as file names for saved results.
Set the datasets
Datasets should be set in a data_dict dictionary.
In order to do this, it is recommended to first define the split position as split_pos of the training and testing part of the dataset as an int.
Then, load the dataset into a data variable calling utils.load_dataset method with the path to the dataset (theoretically, it should be ../res/dataset_name.csv).
Finally, you just need to set the dataset into data_dict as in the following example (note that each dataset in data_dict need a different name):
```python from oddsoptimized.utils import loaddataset, splitdata data = loaddataset("path/to/file.csv") splitpos = len(data) // 2 # for instance, if you want to split the dataset in two almost equal parts xtrain, ytrain, xtest, ytest = splitdata(data, split_pos)
datadict = dict() datadict["datasetname"] = { "xtest": xtest, "xtrain": xtrain, "ytest": y_test, } ```
Run evaluation
You just have to call utils.compute_and_save and utils.plot_and_save_results with the variables defined earlier:
python
from odds_optimized.utils import compute_and_save, plot_and_save_results
METHODS = [
# list of methods dicts as explained in "Define methods and parameterizations" section
]
data_dict = {
# datasets as explained in "Set the datasets" section
}
compute_and_save(METHODS, data_dict, "/path/to/res") # here "/path/to/res" is used as a header for the files containing pickle models
plot_and_save_results(METHODS, data_dict, "/path/to/res") # here "/path/to/res" is used as a header for the files containing graphs and metrics
This will generate the resulting files in the /path/to folder. The pickle files save the results in order to avoid computing the model again each time. If you want to recompute a model, the associated pickle file has to be deleted manually. One csv file is generated for each dataset, containing the results for each method. One png file is generated, showing the results in a plot.
Owner
- Login: fgrivet
- Kind: user
- Repositories: 1
- Profile: https://github.com/fgrivet
Citation (CITATION.cff)
cff-version: 1.2.0
title: ODDS
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Kevin
family-names: Ducharlet
email: kevin.ducharlet@berger-levrault.com
affiliation: Carl Berger-Levrault
orcid: 'https://orcid.org/0000-0003-0053-8874'
repository-code: 'https://github.com/kyducharlet/odds'
license: GPL-3.0
GitHub Events
Total
- Push event: 1
- Create event: 1
Last Year
- Push event: 1
- Create event: 1
Dependencies
- matplotlib ==3.8.1
- numpy ==1.26.1
- openpyxl ==3.2.0b1
- pandas ==2.1.2
- pympler ==1.0.1
- scikit-learn ==1.3.2
- scipy ==1.11.3
- seaborn ==0.13.0
- smartsifter ==0.1.1.dev1
- tqdm ==4.66.1