segmentae

SegmentAE: A Python Library for Anomaly Detection Optimization

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Keywords

anomaly-detection autoencoder-segmentation autoencoders clustering data-processing data-science deep-learning fraud-detection machine-learning neural-networks novelty-detection

Last synced: 11 months ago · JSON representation

Repository

SegmentAE: A Python Library for Anomaly Detection Optimization

Basic Info

Host: GitHub
Owner: TsLu1s
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 111 KB

Statistics

Stars: 7
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Topics

anomaly-detection autoencoder-segmentation autoencoders clustering data-processing data-science deep-learning fraud-detection machine-learning neural-networks novelty-detection

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

SegmentAE: A Python Library for Anomaly Detection Optimization

Framework Overview

SegmentAE is designed to enhance anomaly detection performance through the optimization of reconstruction error by integrating and intersecting clustering methods with tabular autoencoders. It provides a versatile, scalable and robust solution for anomaly detection applications in relevant domains such as financial fraud detection or network security, ensuring extensive customization and optimization capabilities.

Key Features and Capabilities

1. General Applicability on Tabular Datasets

SegmentAE is engineered to handle a wide range of tabular datasets, making it suitable for various anomaly detection tasks across different use case contexts, it can be seamlessly integrated into diverse applications, ensuring broad utility and adaptability.

2. Optimization and Customization

The framework offers complete configurability for each component of the anomaly detection pipeline, this includes data preprocessing, clustering algorithms and provides the customization of baseline autoencoders or the integration of fully developed models. Each component therefore can be fine-tuned to achieve optimal performance tailored to specific use case.

3. Enhanced Detection Performance

By leveraging a combination of clustering algorithms and advanced anomaly detection techniques, SegmentAE aims to improve the accuracy and reliability of anomaly detection. The integration of tabular autoencoders with clustering mechanisms ensures that the framework effectively captures and identifies different patterns in the input data, optimizing this way the reconstruction error for each existent cluster of the anomaly detection, thereby enhancing predictive performance.

Main Development Tools

Major frameworks used to built this project:

Where to get it

Binary installer for the latest released version is available at the Python Package Index (PyPI).

Installation

To install this package from Pypi repository run the following command:

pip install segmentae

SegmentAE - Technical Components and Pipeline Structure

The SegmentAE framework consists of several integrated components, each playing a critical role in the optimization of anomaly detection through clustering and tabular autoencoders. The pipeline is structured to ensure seamless data flow and modular customization, allowing optimal changes for each use case specific needs.

1. Data Preprocessing

Proper preprocessing is crucial for ensuring the quality and consistency of the data fed into the subsequent stages of the pipeline. The data preprocessing module is responsible for preparing raw data for predictive applications, this includes:

Missing Value Imputation: Multiple supervised algorithmic imputation options to handle and impute missing data points.
Normalization: Scaling features to ensure they have comparable magnitudes, essential for the performance of many machine learning algorithms.
Categorical Encoding: Transforming categorical variables into numerical representations suitable for machine learning algorithms, using methods such as label encoding, InverseFrequency encoding and one-hot encoding.

2. Clustering

Clustering forms the backbone of the SegmentAE framework, providing the capability to segment data into meaningful distinct groups. This segmentation helps in understanding the underlying structure of the input data and provides a basis for the anomaly detection reconstruction error improvements.

Clustering Algorithms: Support and customization for a variety of algorithm options such as K-Means, MiniBatchKMeans, GaussianMixture, and Agglomerative clustering, allowing the framework to adapt to different data structures and distribution patterns.

3. Anomaly Detection - Baseline Autoencoders

The core of the SegmentAE framework is its anomaly detection optimization module, which employs advanced methods such as tabular autoencoders to identify anomalies. Autoencoders are neural networks designed to learn efficient representations of input data, enabling the detection of anomalies by measuring reconstruction errors. This framework includes 3 baseline autoencoder algorithms (Dense, Batch Norm & Ensemble) for user application that allow the customization of each, including the network architecture, training epochs, activation layers and others.

Furthermore, it's a main feature option for you to build your own autoencoder model (Keras based) and integrate it into the SegmentAE pipeline ->

Also, application example for totally unlabeled data available here ->

SegmentAE - Predictive Application

To demonstrate the usage of SegmentAE, a DenseAutoencoder is trained and integrated with KMeans clustering (with 3 clusters). The following script outlines the entire process from data loading, preprocessing, clustering, autoencoder training, integration with clustering for anomaly detection, evaluation performance, and predicting future anomalies.

```py import pandas as pd from segmentae.datasources.examples import loaddataset from segmentae.anomalydetection import (SegmentAE, Preprocessing, Clustering, DenseAutoencoder, ) from sklearn.modelselection import traintestsplit

Data Loading

train, test, target = loaddataset(datasetselection = 'networkintrusions', # Options | 'networkintrusions', 'defaultcreditcard', splitratio = 0.75) # | 'htru2dataset', 'shuttle_148'

test, futuredata = traintestsplit(test, trainsize = 0.9, random_state = 5)

Resetting Index is Required

train = train.resetindex(drop=True) test = test.resetindex(drop=True) futuredata = futuredata.reset_index(drop=True)

Xtrain, ytrain = train.drop(columns=[target]).copy(), train[target].astype(int) Xtest, ytest = test.drop(columns=[target]).copy(), test[target].astype(int) Xfuturedata = future_data.drop(columns=[target]).copy()

Preprocessing

pr = Preprocessing(encoder = "IFrequencyEncoder", # Options | "IFrequencyEncoder", "LabelEncoder", "OneHotEncoder", None scaler = "MinMaxScaler", # Options | "MinMaxScaler", "StandardScaler", "RobustScaler", None imputer = None) # Options | "Simple","RandomForest","ExtraTrees","GBR","KNN", # | "XGBoost","Lightgbm","Catboost", None

pr.fit(X = Xtrain) Xtrain = pr.transform(X = Xtrain) Xtest = pr.transform(X = Xtest) Xfuturedata = pr.transform(X = Xfuture_data)

Clustering Implementation

clmodel = Clustering(clustermodel = ["KMeans"], # Options | KMeans, MiniBatchKMeans, GMM, Agglomerative nclusters = 3) clmodel.clusteringfit(X = Xtrain)

Autoencoder Implementation

denseAutoencoder = DenseAutoencoder(hiddendims = [16, 12, 8, 4], encoderactivation = 'relu',
decoderactivation = 'relu', optimizer = 'adam', learningrate = 0.001, epochs = 150, valsize = 0.15, stoppingpatient = 20, dropoutrate = 0.1, batchsize = None) denseAutoencoder.fit(inputdata = Xtrain) denseAutoencoder.summary()

Autoencoder + Clustering Integration

sg = SegmentAE(aemodel = denseAutoencoder, clmodel = cl_model)

Train Reconstruction

sg.reconstruction(inputdata = Xtrain, thresholdmetric = 'mse') # Options | mse, mae, rmse, maxerror

Reconstruction Performance (Assuming y_test existence)

results = sg.evaluation(inputdata = Xtest, targetcol = ytest, threshold_ratio = 2.0) # Selected Threshold Reconstruction Error Multiplier

predstest, reconmetricstest = sg.predstest, sg.reconstruction_test # Test Metadata by Cluster

Anomaly Detection Predictions

predictions = sg.detections(inputdata = Xfuturedata, thresholdratio = 2.0)

```

👉 Grid Search Optimizer

SegmentAE utilizes a comprehensive optimization and evaluation methodology through its SegmentAE_Optimization pipeline to assess and enhance its anomaly detection capabilities. This approach incorporates grid search optimization strategy designed for extensive experimental ensembles, aiming to systematically identify the optimal combination of various configurations, including:

Different autoencoders
Multiple clustering algorithms
A range of cluster numbers

Furthermore, the impact of different reconstruction error threshold ratios are also analysed, providing a nuanced understanding of the model's performance across multiple scenarios, identifying areas for potential improvement. By employing this rigorous optimization strategy, SegmentAE can be fine-tuned to deliver superior anomaly detection results across diverse datasets and use cases, allowing data-driven decision-making in selecting the most effective models for specific applications ->

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Luis Santos - LinkedIn

Owner

Name: Luís Santos
Login: TsLu1s
Kind: user
Location: Braga, Portugal

Repositories: 4
Profile: https://github.com/TsLu1s

GitHub Events

Total

Watch event: 3
Fork event: 1

Last Year

Watch event: 3
Fork event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 21 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 9
Total maintainers: 1

pypi.org: segmentae

SegmentAE: A Python Library for Anomaly Detection Optimization

Homepage: https://github.com/TsLu1s/SegmentAE
Documentation: https://segmentae.readthedocs.io/
License: MIT
Latest release: 1.0.27
published over 1 year ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 21 Last month

Rankings

Dependent packages count: 10.8%

Average: 35.7%

Dependent repos count: 60.6%

Maintainers (1)

TsLu1s

Last synced: 11 months ago

Dependencies

requirements.txt pypi

atlantic >=1.1.67
mlimputer >=1.0.67
numpy >=1.19.5
pandas >=1.2.0
scipy >=1.11.4
tensorflow >=2.10.0
ucimlrepo >=0.0.7

setup.py pypi

segmentae

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SegmentAE: A Python Library for Anomaly Detection Optimization

Framework Overview

Key Features and Capabilities

1. General Applicability on Tabular Datasets

2. Optimization and Customization

3. Enhanced Detection Performance

Main Development Tools

Where to get it

Installation

SegmentAE - Technical Components and Pipeline Structure

1. Data Preprocessing

2. Clustering

3. Anomaly Detection - Baseline Autoencoders

SegmentAE - Predictive Application

Data Loading

Resetting Index is Required

Preprocessing

Clustering Implementation

Autoencoder Implementation

Autoencoder + Clustering Integration

Train Reconstruction

Reconstruction Performance (Assuming y_test existence)

Anomaly Detection Predictions

👉 Grid Search Optimizer

License

Contact

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: segmentae

Rankings

Maintainers (1)

Dependencies