deepfastmlu

Machine learning utilities to help speed up the prototyping process.

https://github.com/fabprezja/deep-fast-machine-learning-utils

Keywords

deep-learning feature-selection machine-learning neural-network utilities

Last synced: 6 months ago · JSON representation ·

Repository

Machine learning utilities to help speed up the prototyping process.

Basic Info

Host: GitHub
Owner: fabprezja
License: mit
Language: Python
Default Branch: main
Homepage: https://fabprezja.github.io/deep-fast-machine-learning-utils/
Size: 2.66 MB

Statistics

Stars: 7
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 1

Topics

deep-learning feature-selection machine-learning neural-network utilities

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Deep Fast Machine Learning Utils

Deep Fast Machine Learning Utils Logo

Welcome to the Deep Fast Machine Learning Utils! This library is designed to streamline and expedite your machine learning prototyping process. It offers unique tools for model search and feature selection, which are not found in other ML libraries. The aim is to complement established libraries such as Tensorflow, Keras and Scikit-learn. Additionally, it provides extra tools for dataset management and visualization of training outcomes.

Documentation at: https://fabprezja.github.io/deep-fast-machine-learning-utils/

Note: This library is in the early stages of development.

Installation

You can install the library directly using pip:

bash pip install deepfastmlu

Citation

If you find this library useful in your research, please consider citing: shell @misc{fabprezja_2023_dfmlu, author = {Fabi Prezja}, title = {Deep Fast Machine Learning Utils}, month = sept, year = {2023}, publisher = {GitHub}, journal = {GitHub Repository}, howpublished = {\url{https://github.com/fabprezja/deep-fast-vision}}, doi = {10.5281/zenodo.8374468}, url = {https://doi.org/10.5281/zenodo.8374468} } If you used the Adaptive Variance Threshold (AVT) class, please cite the following article: shell Paper + Citation Comming Soon

Note: When referencing, please consider additional attributions to Tensorflow, Scikit-learn, and Keras, as the library is built around them.

Model Search

Principal Component Cascade Dense Neural Architecture Search (PCCDNAS)

PCCDNAS provides an automated method for designing dense neural networks. Using PCA (Principal Component Analysis), it systematically sets the number of neurons in each layer of the network. After applying PCA to the initial data, the neuron count for the first layer is determined based on the principal components (PCs) for a given variance threshold. Subsequently, the cascade mechanism ensures that the activations from each trained layer undergo PCA again. This process, in turn, determines the neuron count for the subsequent layers using the same principal component variance threshold criteria.

```shell PCCDNAS Core Pseudo-Algorithm (Paper Comming Soon):

Initialize:
- Create an empty neural network model.
- Create an empty list to store the number of neurons for each layer.
Data Initialization:
- Accept training data and labels.
- Center or Normalize the data if required.
Initialize Model Search:
- Set hyperparameters (e.g., number of layers, PCA variance threshold, etc.).
Build the Neural Network Model:
- While not reached the desired number of layers: a. If at the first layer build stage, use the original training data. b. For subsequent layer build stage:
  - Train the model.
  - Extract the activations from the last layer (for each data point). c. Perform PCA on the data (original or activations). d. Determine the number of principal components that meet the variance threshold. e. Set the number of neurons in the next layer based on the determined principal components count. f. Add the layer to the model. ```

Usage: ```python from deepfastmlu.modelsearch import PCCDNAS from sklearn.modelselection import traintestsplit

Split the data into training and validation sets

Xtrain, Xval, ytrain, yval = traintestsplit(X, y, testsize=0.2, randomstate=42)

Initialize the PCCDNAS object

pccdnas = PCCDNAS()

Initialize data for the model search

pccdnas.datainit(Xtrain=Xtrain, ytrain=ytrain, validation=(Xval, y_val), normalize=True, unit=True)

Initialize model search hyperparameters

pccdnas.initializemodelsearch( epochs=10, # Number of training epochs layers=3, # Number of layers in the neural network activation='relu', # Activation function for the layers pcavariance=[0.95,0.84,0.63], # Desired explained variance for PCA for each layer loss='binarycrossentropy', # Loss function for the model optimizer='adam', # Optimizer for the model metrics=['accuracy'], # List of metrics to be evaluated during training outputneurons=1, # Number of neurons in the output layer outactivation='sigmoid', # Activation function for the output layer stopcriteria='valloss', # Criteria for early stopping esmode='min', # Mode for early stopping (maximize the stopcriteria) dropout=0.2, # Dropout rate for dropout layers regularize=('l2', 0.01), # Regularization type ('l2') and value (0.01) batchsize=32, # Batch size for training kernelinitializer='henormal', # Kernel initializer for the dense layers batchnorm=True, # Whether to include batch normalization layers espatience=5, # Number of epochs with no improvement for early stopping verbose=1, # Verbosity mode (1 = progress bar) learnrate=0.001 # Learning rate for the optimizer )

Build the model

model, numneurons = pccdnas.build() print("Number of neurons in each layer:", numneurons) ```

Feature Selection

Adaptive Variance Threshold (AVT)

Adaptive Variance Threshold is a feature selector that dynamically determines a variance threshold based on the provided percentile of the feature variances. Features with a variance below this threshold are dropped. Traditional (non-zero) variance-based feature selection uses a dataset dependent manual threshold, which is not optimal between datasets.

Usage: ```python from sklearn.modelselection import traintestsplit from deepfastmlu.featureselect import AdaptiveVarianceThreshold

Split the data into training and validation sets

Xtrain, Xval, ytrain, yval = traintestsplit(X, y, testsize=0.2, randomstate=42)

Initialize AdaptiveVarianceThreshold

avt = AdaptiveVarianceThreshold(percentile=1.5, verbose=True)

Fit AVT to the training data

avt.fit(X_train)

Transform both training and validation data

Xtrainnew = avt.transform(Xtrain) Xvalnew = avt.transform(Xval) ```

Rank Aggregated Feature Selection

RankAggregatedFS is a feature selector that aggregates the rankings of features from multiple feature selection methods. It combines the scores or rankings of features from different methods to provide a unified ranking of features. This approach can be useful when there's uncertainty about which feature selection method to use, as it combines the strengths of multiple methods.

Usage: ```python from sklearn.modelselection import traintestsplit from sklearn.pipeline import Pipeline from sklearn.featureselection import VarianceThreshold, SelectKBest, mutualinfoclassif, fclassif from sklearn.preprocessing import StandardScaler from deepfastmlu.featureselect import RankAggregatedFS

Split the data into training and test sets

Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)

Create feature selection methods

varianceselector = VarianceThreshold(threshold=0.0) miselector = SelectKBest(scorefunc=mutualinfoclassif, k=10) fclassifselector = SelectKBest(scorefunc=f_classif, k=10)

Initialize RankAggregatedFS with multiple methods (excluding VarianceThreshold)

rankaggregatedfs = RankAggregatedFS(methods=[miselector, fclassif_selector], k=10)

pipeline = Pipeline([ ('scaler', StandardScaler()), # Normalize the data ('variancethreshold', varianceselector), # Apply VarianceThreshold ('rankaggregatedfs', rankaggregatedfs) # Apply RankAggregatedFS ])

Fit the pipeline on the training data

pipeline.fit(Xtrain, ytrain)

Transform the training and test data using the pipeline

Xtrainnew = pipeline.transform(Xtrain) Xtestnew = pipeline.transform(Xtest) ```

Chained Feature Selection

ChainedFS is a feature selector that sequentially applies a list of feature selection methods. This class allows for the chaining of multiple feature selection methods, where the output of one method becomes the input for the next. This can be particularly useful when one wants to combine the strengths of different feature selection techniques or when a sequence of operations is required to refine the feature set.

Usage: ```python from sklearn.modelselection import traintestsplit from sklearn.pipeline import Pipeline from deepfastmlu.featureselect import ChainedFS

Split the data into training and test sets

Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)

Create feature selection methods

varianceselector = VarianceThreshold(threshold=0.0) kbestselector = SelectKBest(scorefunc=mutualinfoclassif, k=10)

Initialize ChainedFS and create a pipeline

chainedfs = ChainedFS([varianceselector, kbestselector]) pipeline = Pipeline([('featureselection', chainedfs)])

Fit the pipeline on the training data

pipeline.fit(Xtrain, ytrain)

Transform the training and test data using the pipeline

Xtrainnew = pipeline.transform(Xtrain) Xtestnew = pipeline.transform(Xtest) ```

Mixing Feature Selection Approaches

In this example we mix the previously shown methods.

Usage: ```python from sklearn.modelselection import traintestsplit from sklearn.pipeline import Pipeline from sklearn.featureselection import SelectKBest, mutualinfoclassif, fclassif from sklearn.preprocessing import StandardScaler from deepfastmlu.featureselect import RankAggregatedFS,AdaptiveVarianceThreshold

Split the data into training and test sets

Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)

Create feature selection methods

adaptivevarianceselector = AdaptiveVarianceThreshold(percentile=1.5) miselector = SelectKBest(scorefunc=mutualinfoclassif, k=10) fclassifselector = SelectKBest(scorefunc=fclassif, k=10)

Initialize RankAggregatedFS with multiple methods

rankaggregatedfs = RankAggregatedFS(methods=[miselector, fclassif_selector], k=10)

pipeline = Pipeline([ ('scaler', StandardScaler()), # Normalize the data ('adaptivevariancethreshold', adaptivevarianceselector), # Apply AdaptiveVarianceThreshold ('rankaggregatedfs', rankaggregatedfs) # Apply RankAggregatedFS ])

Fit the pipeline on the training data

pipeline.fit(Xtrain, ytrain)

Transform the training and test data using the pipeline

Xtrainnew = pipeline.transform(Xtrain) Xtestnew = pipeline.transform(Xtest) ```

Extra Tools

Data Management

Data Splitter

A class to split any folder based data instances into a partition format (train, val, test(s)). The splits are stratified. ```python from deepfastmlu.extra.data_helpers import DatasetSplitter

Define the paths to the original dataset and the destination directory for the split datasets

datadir = 'path/to/original/dataset' destinationdir = 'path/to/destination/directory'

Instantiate the DatasetSplitter class with the desired train, validation, and test set ratios

splitter = DatasetSplitter(datadir, destinationdir, trainratio=0.7, valratio=0.10, testratio=0.10, testratio_2=0.10, seed=42)

Split the dataset into train, validation, and test sets

splitter.run() ```

Data Sub Sampler (miniaturize)

A class to sub-sample (miniaturize) any folder based data instances give a ratio: ```python from deepfastmlu.extra.data_helpers import DataSubSampler

Define the paths to the original dataset and the destination directory for the subsampled dataset

datadir = 'path/to/original/dataset' subsampleddestination_dir = 'path/to/subsampled/dataset'

Instantiate the DataSubSampler class with the desired fraction of files to sample

subsampler = DataSubSampler(datadir, subsampleddestination_dir, fraction=0.5, seed=42)

Create a smaller dataset by randomly sampling a fraction (in this case, 50%) of files from the original dataset

subsampler.createminiaturedataset() ```

Visualizing Results

Plot Validation Curves

Leverage the plothistorycurves function to visualize the training and validation metrics across epochs. This function displays the evolution of your model's performance but also highlights the minimum loss and maximum metric values to make insights clearer.

```python from deepfastmlu.extra.plothelpers import plothistory_curves

Training the model

history = model.fit(Xtrain, ytrainonehot, validationdata=(Xval, yvalonehot), epochs=25, batchsize=32)

Visualize the training history

plothistorycurves(history, showminmaxplot=True, usermetric='accuracy') ``` Example result:
confs2

Plot Generator Confusion Matrix

Utilize the plotconfusionmatrix function to effortlessly generate a confusion matrix for your model's predictions. Designed specifically for Keras image generators, it autonomously identifies class names, offering a straightforward way to gauge classification performance.

```python from deepfastmlu.extra.plothelpers import plotconfusion_matrix

Create the confusion matrix for validation data

model: A trained Keras model.

val_generator: A Keras ImageDataGenerator used for validation.

"Validation Data": Name of the generator, used in the plot title.

"binary": Type of target labels ('binary' or 'categorical').

plotconfusionmatrix(model, val_generator, "Validation Data", "binary") ```

Owner

Name: Fabi Prezja
Login: fabprezja
Kind: user
Location: Jyväskylä, Finland
Company: University of Jyväskylä

Repositories: 1
Profile: https://github.com/fabprezja

Active in medical deep learning; else, guitars, nonfiction reading and music technology.

Citation (CITATION.bib)

@misc{fabprezja_2023_dfmlu,
  author = {Fabi Prezja},
  title = {Deep Fast Machine Learning Utils},
  month = sept,
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/fabprezja/deep-fast-vision}},
  doi = {10.5281/zenodo.8374468},
  url = {https://doi.org/10.5281/zenodo.8374468}
}

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: 8 months ago

All Time

Total Commits: 11
Total Committers: 2
Avg Commits per committer: 5.5
Development Distribution Score (DDS): 0.091

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Fabi Prezja	8****a	10
Fabi Prezja	f**a@f**i	1

Committer Domains (Top 20 + Academic)

fairn.fi: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 12 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 1
Total maintainers: 1

pypi.org: deepfastmlu

Machine learning utilities to help speed up your prototyping process.

Homepage: https://github.com/fabprezja/deep-fast-machine-learning-utils
Documentation: https://deepfastmlu.readthedocs.io/
License: MIT License
Latest release: 0.1
published over 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 12 Last month

Rankings

Dependent packages count: 7.4%

Average: 38.1%

Dependent repos count: 68.8%

Maintainers (1)

faprezja

Last synced: 6 months ago

deepfastmlu

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Deep Fast Machine Learning Utils

Installation

Citation

Table of Contents

Model Search

Principal Component Cascade Dense Neural Architecture Search (PCCDNAS)

Split the data into training and validation sets

Initialize the PCCDNAS object

Initialize data for the model search

Initialize model search hyperparameters

Build the model

Feature Selection

Adaptive Variance Threshold (AVT)

Split the data into training and validation sets

Initialize AdaptiveVarianceThreshold

Fit AVT to the training data

Transform both training and validation data

Rank Aggregated Feature Selection

Split the data into training and test sets

Create feature selection methods

Initialize RankAggregatedFS with multiple methods (excluding VarianceThreshold)

Fit the pipeline on the training data

Transform the training and test data using the pipeline

Chained Feature Selection

Split the data into training and test sets

Create feature selection methods

Initialize ChainedFS and create a pipeline

Fit the pipeline on the training data

Transform the training and test data using the pipeline

Mixing Feature Selection Approaches

Split the data into training and test sets

Create feature selection methods

Initialize RankAggregatedFS with multiple methods

Fit the pipeline on the training data

Transform the training and test data using the pipeline

Extra Tools

Data Management

Data Splitter

Define the paths to the original dataset and the destination directory for the split datasets

Instantiate the DatasetSplitter class with the desired train, validation, and test set ratios

Split the dataset into train, validation, and test sets

Data Sub Sampler (miniaturize)

Define the paths to the original dataset and the destination directory for the subsampled dataset

Instantiate the DataSubSampler class with the desired fraction of files to sample

Create a smaller dataset by randomly sampling a fraction (in this case, 50%) of files from the original dataset

Visualizing Results

Plot Validation Curves

Training the model

Visualize the training history

Plot Generator Confusion Matrix

Create the confusion matrix for validation data

model: A trained Keras model.

val_generator: A Keras ImageDataGenerator used for validation.

"Validation Data": Name of the generator, used in the plot title.

"binary": Type of target labels ('binary' or 'categorical').

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors