deepfastmlu

Machine learning utilities to help speed up the prototyping process.

https://github.com/fabprezja/deep-fast-machine-learning-utils

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.7%) to scientific vocabulary

Keywords

deep-learning feature-selection machine-learning neural-network utilities
Last synced: 6 months ago · JSON representation ·

Repository

Machine learning utilities to help speed up the prototyping process.

Basic Info
Statistics
  • Stars: 7
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Topics
deep-learning feature-selection machine-learning neural-network utilities
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

Deep Fast Machine Learning Utils

Deep Fast Machine Learning Utils Logo

Welcome to the Deep Fast Machine Learning Utils! This library is designed to streamline and expedite your machine learning prototyping process. It offers unique tools for model search and feature selection, which are not found in other ML libraries. The aim is to complement established libraries such as Tensorflow, Keras and Scikit-learn. Additionally, it provides extra tools for dataset management and visualization of training outcomes.

Documentation at: https://fabprezja.github.io/deep-fast-machine-learning-utils/

Note: This library is in the early stages of development.

Installation

You can install the library directly using pip:

bash pip install deepfastmlu

Citation

If you find this library useful in your research, please consider citing: shell @misc{fabprezja_2023_dfmlu, author = {Fabi Prezja}, title = {Deep Fast Machine Learning Utils}, month = sept, year = {2023}, publisher = {GitHub}, journal = {GitHub Repository}, howpublished = {\url{https://github.com/fabprezja/deep-fast-vision}}, doi = {10.5281/zenodo.8374468}, url = {https://doi.org/10.5281/zenodo.8374468} } If you used the Adaptive Variance Threshold (AVT) class, please cite the following article: shell Paper + Citation Comming Soon

Note: When referencing, please consider additional attributions to Tensorflow, Scikit-learn, and Keras, as the library is built around them.

Table of Contents

  1. Model Search
  2. Feature Selection
  3. Extra Tools

Model Search

Principal Component Cascade Dense Neural Architecture Search (PCCDNAS)

PCCDNAS provides an automated method for designing dense neural networks. Using PCA (Principal Component Analysis), it systematically sets the number of neurons in each layer of the network. After applying PCA to the initial data, the neuron count for the first layer is determined based on the principal components (PCs) for a given variance threshold. Subsequently, the cascade mechanism ensures that the activations from each trained layer undergo PCA again. This process, in turn, determines the neuron count for the subsequent layers using the same principal component variance threshold criteria.

```shell PCCDNAS Core Pseudo-Algorithm (Paper Comming Soon):

  1. Initialize:

    • Create an empty neural network model.
    • Create an empty list to store the number of neurons for each layer.
  2. Data Initialization:

    • Accept training data and labels.
    • Center or Normalize the data if required.
  3. Initialize Model Search:

    • Set hyperparameters (e.g., number of layers, PCA variance threshold, etc.).
  4. Build the Neural Network Model:

    • While not reached the desired number of layers: a. If at the first layer build stage, use the original training data. b. For subsequent layer build stage:
      • Train the model.
      • Extract the activations from the last layer (for each data point). c. Perform PCA on the data (original or activations). d. Determine the number of principal components that meet the variance threshold. e. Set the number of neurons in the next layer based on the determined principal components count. f. Add the layer to the model. ```

Usage: ```python from deepfastmlu.modelsearch import PCCDNAS from sklearn.modelselection import traintestsplit

Split the data into training and validation sets

Xtrain, Xval, ytrain, yval = traintestsplit(X, y, testsize=0.2, randomstate=42)

Initialize the PCCDNAS object

pccdnas = PCCDNAS()

Initialize data for the model search

pccdnas.datainit(Xtrain=Xtrain, ytrain=ytrain, validation=(Xval, y_val), normalize=True, unit=True)

Initialize model search hyperparameters

pccdnas.initializemodelsearch( epochs=10, # Number of training epochs layers=3, # Number of layers in the neural network activation='relu', # Activation function for the layers pcavariance=[0.95,0.84,0.63], # Desired explained variance for PCA for each layer loss='binarycrossentropy', # Loss function for the model optimizer='adam', # Optimizer for the model metrics=['accuracy'], # List of metrics to be evaluated during training outputneurons=1, # Number of neurons in the output layer outactivation='sigmoid', # Activation function for the output layer stopcriteria='valloss', # Criteria for early stopping esmode='min', # Mode for early stopping (maximize the stopcriteria) dropout=0.2, # Dropout rate for dropout layers regularize=('l2', 0.01), # Regularization type ('l2') and value (0.01) batchsize=32, # Batch size for training kernelinitializer='henormal', # Kernel initializer for the dense layers batchnorm=True, # Whether to include batch normalization layers espatience=5, # Number of epochs with no improvement for early stopping verbose=1, # Verbosity mode (1 = progress bar) learnrate=0.001 # Learning rate for the optimizer )

Build the model

model, numneurons = pccdnas.build() print("Number of neurons in each layer:", numneurons) ```

Feature Selection

Adaptive Variance Threshold (AVT)

Adaptive Variance Threshold is a feature selector that dynamically determines a variance threshold based on the provided percentile of the feature variances. Features with a variance below this threshold are dropped. Traditional (non-zero) variance-based feature selection uses a dataset dependent manual threshold, which is not optimal between datasets.

Usage: ```python from sklearn.modelselection import traintestsplit from deepfastmlu.featureselect import AdaptiveVarianceThreshold

Split the data into training and validation sets

Xtrain, Xval, ytrain, yval = traintestsplit(X, y, testsize=0.2, randomstate=42)

Initialize AdaptiveVarianceThreshold

avt = AdaptiveVarianceThreshold(percentile=1.5, verbose=True)

Fit AVT to the training data

avt.fit(X_train)

Transform both training and validation data

Xtrainnew = avt.transform(Xtrain) Xvalnew = avt.transform(Xval) ```

Rank Aggregated Feature Selection

RankAggregatedFS is a feature selector that aggregates the rankings of features from multiple feature selection methods. It combines the scores or rankings of features from different methods to provide a unified ranking of features. This approach can be useful when there's uncertainty about which feature selection method to use, as it combines the strengths of multiple methods.

Usage: ```python from sklearn.modelselection import traintestsplit from sklearn.pipeline import Pipeline from sklearn.featureselection import VarianceThreshold, SelectKBest, mutualinfoclassif, fclassif from sklearn.preprocessing import StandardScaler from deepfastmlu.featureselect import RankAggregatedFS

Split the data into training and test sets

Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)

Create feature selection methods

varianceselector = VarianceThreshold(threshold=0.0) miselector = SelectKBest(scorefunc=mutualinfoclassif, k=10) fclassifselector = SelectKBest(scorefunc=f_classif, k=10)

Initialize RankAggregatedFS with multiple methods (excluding VarianceThreshold)

rankaggregatedfs = RankAggregatedFS(methods=[miselector, fclassif_selector], k=10)

pipeline = Pipeline([ ('scaler', StandardScaler()), # Normalize the data ('variancethreshold', varianceselector), # Apply VarianceThreshold ('rankaggregatedfs', rankaggregatedfs) # Apply RankAggregatedFS ])

Fit the pipeline on the training data

pipeline.fit(Xtrain, ytrain)

Transform the training and test data using the pipeline

Xtrainnew = pipeline.transform(Xtrain) Xtestnew = pipeline.transform(Xtest) ```

Chained Feature Selection

ChainedFS is a feature selector that sequentially applies a list of feature selection methods. This class allows for the chaining of multiple feature selection methods, where the output of one method becomes the input for the next. This can be particularly useful when one wants to combine the strengths of different feature selection techniques or when a sequence of operations is required to refine the feature set.

Usage: ```python from sklearn.modelselection import traintestsplit from sklearn.pipeline import Pipeline from deepfastmlu.featureselect import ChainedFS

Split the data into training and test sets

Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)

Create feature selection methods

varianceselector = VarianceThreshold(threshold=0.0) kbestselector = SelectKBest(scorefunc=mutualinfoclassif, k=10)

Initialize ChainedFS and create a pipeline

chainedfs = ChainedFS([varianceselector, kbestselector]) pipeline = Pipeline([('featureselection', chainedfs)])

Fit the pipeline on the training data

pipeline.fit(Xtrain, ytrain)

Transform the training and test data using the pipeline

Xtrainnew = pipeline.transform(Xtrain) Xtestnew = pipeline.transform(Xtest) ```

Mixing Feature Selection Approaches

In this example we mix the previously shown methods.

Usage: ```python from sklearn.modelselection import traintestsplit from sklearn.pipeline import Pipeline from sklearn.featureselection import SelectKBest, mutualinfoclassif, fclassif from sklearn.preprocessing import StandardScaler from deepfastmlu.featureselect import RankAggregatedFS,AdaptiveVarianceThreshold

Split the data into training and test sets

Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.2, randomstate=42)

Create feature selection methods

adaptivevarianceselector = AdaptiveVarianceThreshold(percentile=1.5) miselector = SelectKBest(scorefunc=mutualinfoclassif, k=10) fclassifselector = SelectKBest(scorefunc=fclassif, k=10)

Initialize RankAggregatedFS with multiple methods

rankaggregatedfs = RankAggregatedFS(methods=[miselector, fclassif_selector], k=10)

pipeline = Pipeline([ ('scaler', StandardScaler()), # Normalize the data ('adaptivevariancethreshold', adaptivevarianceselector), # Apply AdaptiveVarianceThreshold ('rankaggregatedfs', rankaggregatedfs) # Apply RankAggregatedFS ])

Fit the pipeline on the training data

pipeline.fit(Xtrain, ytrain)

Transform the training and test data using the pipeline

Xtrainnew = pipeline.transform(Xtrain) Xtestnew = pipeline.transform(Xtest) ```

Extra Tools

Data Management

Data Splitter

A class to split any folder based data instances into a partition format (train, val, test(s)). The splits are stratified. ```python from deepfastmlu.extra.data_helpers import DatasetSplitter

Define the paths to the original dataset and the destination directory for the split datasets

datadir = 'path/to/original/dataset' destinationdir = 'path/to/destination/directory'

Instantiate the DatasetSplitter class with the desired train, validation, and test set ratios

splitter = DatasetSplitter(datadir, destinationdir, trainratio=0.7, valratio=0.10, testratio=0.10, testratio_2=0.10, seed=42)

Split the dataset into train, validation, and test sets

splitter.run() ```

Data Sub Sampler (miniaturize)

A class to sub-sample (miniaturize) any folder based data instances give a ratio: ```python from deepfastmlu.extra.data_helpers import DataSubSampler

Define the paths to the original dataset and the destination directory for the subsampled dataset

datadir = 'path/to/original/dataset' subsampleddestination_dir = 'path/to/subsampled/dataset'

Instantiate the DataSubSampler class with the desired fraction of files to sample

subsampler = DataSubSampler(datadir, subsampleddestination_dir, fraction=0.5, seed=42)

Create a smaller dataset by randomly sampling a fraction (in this case, 50%) of files from the original dataset

subsampler.createminiaturedataset() ```

Visualizing Results

Plot Validation Curves

Leverage the plothistorycurves function to visualize the training and validation metrics across epochs. This function displays the evolution of your model's performance but also highlights the minimum loss and maximum metric values to make insights clearer.

```python from deepfastmlu.extra.plothelpers import plothistory_curves

Training the model

history = model.fit(Xtrain, ytrainonehot, validationdata=(Xval, yvalonehot), epochs=25, batchsize=32)

Visualize the training history

plothistorycurves(history, showminmaxplot=True, usermetric='accuracy') ``` Example result:
confs2

Plot Generator Confusion Matrix

Utilize the plotconfusionmatrix function to effortlessly generate a confusion matrix for your model's predictions. Designed specifically for Keras image generators, it autonomously identifies class names, offering a straightforward way to gauge classification performance.

```python from deepfastmlu.extra.plothelpers import plotconfusion_matrix

Create the confusion matrix for validation data

model: A trained Keras model.

val_generator: A Keras ImageDataGenerator used for validation.

"Validation Data": Name of the generator, used in the plot title.

"binary": Type of target labels ('binary' or 'categorical').

plotconfusionmatrix(model, val_generator, "Validation Data", "binary") ```

Owner

  • Name: Fabi Prezja
  • Login: fabprezja
  • Kind: user
  • Location: Jyväskylä, Finland
  • Company: University of Jyväskylä

Active in medical deep learning; else, guitars, nonfiction reading and music technology.

Citation (CITATION.bib)

@misc{fabprezja_2023_dfmlu,
  author = {Fabi Prezja},
  title = {Deep Fast Machine Learning Utils},
  month = sept,
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/fabprezja/deep-fast-vision}},
  doi = {10.5281/zenodo.8374468},
  url = {https://doi.org/10.5281/zenodo.8374468}
}

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 11
  • Total Committers: 2
  • Avg Commits per committer: 5.5
  • Development Distribution Score (DDS): 0.091
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Fabi Prezja 8****a 10
Fabi Prezja f****a@f****i 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 12 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
pypi.org: deepfastmlu

Machine learning utilities to help speed up your prototyping process.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 12 Last month
Rankings
Dependent packages count: 7.4%
Average: 38.1%
Dependent repos count: 68.8%
Maintainers (1)
Last synced: 6 months ago