https://github.com/cosbidev/naim

Official implementation for the paper ``Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets´´

Keywords

attention-mechanism missing-data tabular-data transformers

Last synced: 5 months ago · JSON representation

Repository

Official implementation for the paper ``Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets´´

Basic Info

Host: GitHub
Owner: cosbidev
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 225 KB

Statistics

Stars: 6
Watchers: 0
Forks: 3
Open Issues: 0
Releases: 0

Topics

attention-mechanism missing-data tabular-data transformers

Created over 1 year ago · Last pushed 11 months ago

Metadata Files

Readme License

NAIM

Installation
Usage
1. Reproducing the experiments
2. Train & Test on your dataset
Citation

This document describes the implementation of ``Not Another Imputation Method´´ (NAIM) in Pytorch. NAIM is an architecture specifically designed for the analysis of tabular data, with a focus on addressing missing values in tabular data without the need for any imputation strategy.

By leveraging lookup tables, tailored for each type of feature, NAIM assigns a non-trainable vector to missing values, thus obtaining an embedded representation for every missing feature scenario.

Following this, through our innovative self-attention mechanism, all contributions from missing values in the attention matrix are ignored.

Ultimately, this approach to handling missing values paves the way for a novel method of data augmentation, inspired by methods used in classical image data augmentation. At every epoch, samples are randomly masked (where possible) to prevent co-adaptations among features and to enhance the model's generalization capability.

Installation

We used Python 3.9 for the development of the code. To install the required packages, it is sufficient to run the following command: bash pip install -r requirements.txt and install a version of Pytorch compatible with the device available. We used torch==1.13.0.

Usage

The execution of the code heavily relies on Facebook's Hydra library. Specifically, through a multitude of configuration files that define every aspect of the experiment, it is possible to conduct the desired experiment without modifying the code. These configuration files have a hierarchical structure through which they are composed into a single configuration file that serves as input to the program. More specifically, the main.py file will call the config.yaml file, from which the configuration files tree begins.

Reproducing the experiments

All the dataset used in the paper are available on the UCI Datasets repository and they are listed below:

| Dataset | UCI Link | Preprocessing | |----------------|------------|---------------| | ADULT | [Link](https://archive.ics.uci.edu/ml/datasets/adult)| Sets joined | | BankMarketing | [Link](https://archive.ics.uci.edu/ml/datasets/bank+marketing)| - | | OnlineShoppers | [Link](https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset)| - | | SeismicBumps | [Link](https://archive.ics.uci.edu/ml/datasets/seismic-bumps)| - | | Spambase | [Link](https://archive.ics.uci.edu/ml/datasets/spambase)| - |

To simplify the reproduction of the experiments done, it is possible to use the datasets_download.py script to download the files. Therefore, thanks also to the multirun functionality of Hydra, to reproduce the NAIM experiments carried out in the paper it is sufficient to execute the following lines of code:

```python python datasets_download.py

python main.py -m experiment=classificationwithmissing_generation experiment/databases@db=adult,bankmarketing,onlineshoppers,seismicbumps,spambase ```

These lines of code, assuming the initial configuration files have not been modified, enable the reproduction of the experiments presented in the paper using the NAIM model. These experiments generate different percentages of missing values in the training and testing sets, considering also all possible combinations of missing percentages in the different sets. Specifically, the percentages 0%, 5%, 10%, 25%, 50%, and 75% have been used as indicated by missing_percentages=[0.0, 0.05, 0.1, 0.25, 0.5, 0.75] in the classification_with_missing_generation.yaml configuration file.

For each experiment, this code produces a folder named <dataset-name>_<model-name>_<imputation-strategy>_with_missing_generation which contains everything generated by the code. In particular, the following folders and files are present:

cross_validation: this folder contains a folder for each training fold, indicated as a composition of test and validation folds <test-fold>_<val-fold>, reporting the information on the train, validation and test sets in 3 separate csv files.
preprocessing: this folder contains all the preprocessing information divided into 3 main folders:
1. numerical_preprocessing: in this folder, for each percentage of missing values considered, there is a csv file for each fold reporting the information on the preprocessing params of numerical features.
2. categorical_preprocessing: in this folder, for each percentage of missing values considered, there is a csv file for each fold reporting the information on the preprocessing params of categorical features.
3. imputer: in this folder, for each percentage of missing values considered, there are csv files for each fold with information on the imputation strategy applied to handle missing values and a pkl file containing the imputer fitted on the training data of the fold.
saved_models: this folder contains, for each percentage of missing values considered, a folder with the model's name that includes, for each fold, a csv file with the model's parameters and a pkl or pth file containing the trained model.
predictions: this folder contains, for each percentage of missing values considered, a folder that reports the predictions obtained from the training and validation sets and separately those of the test set. More specifically, there are two files for each fold reporting the predictions for the train and validation sets, called train and val respectively. Moreover, there are additional folders, one for each percentage of missing values considered, that report for each fold the predictions made on the test set (test).
results: this folder reports, for each percentage of missing values considered, the performance on the train, validation, and test sets separately. Specifically, for the training and validation sets and then, for each percentage of missing values considered, also for the test set, two folders named balanced and unbalanced containing the performance of the various sets are reported. These are presented in 3 separate files with increasing levels of averaging:
1. all_test_performance.csv: this file presents the set's performance evaluated for each fold and each class.
2. classes_average_performance.csv: this file, computing the average performance of the folds, contains the performance for each class.
3. set_average_performance.csv: this file, calculating the average performance of the folds and the classes, contains the average performance of the set.
config.yaml: this file contains the configuration file used as input for the experiment.
<experiment-name>.log: this is the log file of the experiment.

Train & Test on your dataset

Experiment declaration

As mentioned above, the experiment configuration file is created at the time of code execution starting from the config.yaml file, in which the configuration file for the experiment to be performed is declared, along with the paths from which to load data (data_path) and where to save the outputs (output_path).

```yaml datapath: ./datasets # Path where the datasets are stored outputpath: ./outputs # Path where the outputs will be saved

defaults: # DO NOT CHANGE - self # DO NOT CHANGE - experiment: classification # Experiment to perform, classification or classificationwithmissing_generation ```

The possible options for the experiment parameter are classification and classification_with_missing_generation.

Dataset preparation

To prepare a dataset for the analysis with this code, it is sufficient to prepare a configuration file, specific for the dataset, similar to those already provided in the folder ./confs/experiment/databases. The path to the data must be specified in the path parameter in the dataset's configuration file. Thanks to the interpolation functionality of Hydra the path can be composed using the ${data_path} interpolation key, which refers to the data_path parameter of the config.yaml file. Once the dataset configuration file is prepared, it is important that it is placed in the same folder ./confs/experiment/databases and that in the file classification.yaml the name of the created configuration file is reported at the databases@db key. In particular, it is important that the dataset configuration file is structured as follows:

```yaml target: CMCutils.datasets.ClassificationDataset # DO NOT CHANGE _convert: all # DO NOT CHANGE

name: # Name of the dataset db_type: tabular # DO NOT CHANGE

classes: ["", ..., ""] # List of the classes label_type: multiclass # multiclass or binary (SPIEGARE BINARY)

task: classification # DO NOT CHANGE

path: ${datapath}/ # Relative path to the file, KEEP ${datapath}/... to compose the path condisering the data_path parameter defined in the 'config.yaml' file.

columns: # Dictionary containing features names as keys and their types as values # DO NOT REMOVE : id # Name of the ID column if present, DO NOT CHANGE THE VALUE, NAME CORRECTLY THE TARGET VARIABLE : # int, float or category : # int, float or category : # int, float or category # Other features to be inserted : target # Name of the target column containing the classes, DO NOT CHANGE THE VALUE, NAME CORRECTLY THE TARGET VARIABLE

here any pd.readcsv or pd.readexcel input parameter to correctly load the data can be added (e.g., navalues, header, indexcol)

pandasloadkwargs: navalues: [ "?" ] header: 0 indexcol: 0

datasetclass: # DO NOT CHANGE _target: CMCutils.datasets.SupervisedTabularDatasetTorch # DO NOT CHANGE _convert: all # DO NOT CHANGE ```

In the columns definition, id and target feature types can be used to define the ID and classes columns respectively.

Experiment configuration

Here we present the classification.yaml configuration file, which defines the specifics for conducting a classification pipeline using all the available data. It is also available the classification_with_missing_generation.yaml configuration file, which defines the specifics for the experiment of the paper, where predefined percentages of missing values are randomly generated in the train and test set. To execute the experiment of the paper, you need to set the experiment parameter in the config.yaml file to classification_with_missing_generation. You can also customize the missing percentages values to be tested by modifying the missing_percentages parameter in the classification_with_missing_generation.yaml file.

The classification.yaml file is the primary file where all the parameters of the experiment are defined. It begins with some general information, such as the name of the experiment, the pipeline to be executed, the seed for randomness control, training verbosity, and the percentages of missing values to be tested.

```yaml experimentname: ${db.name}${model.name}${preprocessing.imputer.method}withmissinggeneration # DO NOT CHANGE pipeline: missing # DO NOT CHANGE

seed: 42 # Seed for randomness control verbose: 1 # 0 or 1, verbosity of the training

continue_experiment: False # True or False, if the experiment should be continued from where it was interrupted ```

NOTE: In case an experiment should be interrupted, voluntarily or not, it is possible to resume it from where it was interrupted by setting the continue_experiment parameter to True.

Then, all other necessary configuration files for the different parts of the experiment are declared. It is possible to define: - the dataset to analyse (databases@db); - the cross-validation strategies to use for the test (cross_validation@test_cv) and validation (cross_validation@val_cv) sets separately; - the preprocessing steps to be performed for the numerical (preprocessing/numerical), categorical (preprocessing/categorical) and missing features (preprocessing/imputer) ; - the model to be used (model). - the metric to be used in the early stopping process (metric@train.set_metrics.<metric-name>) and in the performance evaluation (metric@performance_metrics.<metric-name>).

```yaml defaults: # DO NOT CHANGE - self # DO NOT CHANGE - paths@: default # DO NOT CHANGE - paths: experiment_paths # DO NOT CHANGE

databases@db: # Name of the configuration file of the dataset
crossvalidation@testcv: stratifiedkfold # Cross-validation strategy for the test set
crossvalidation@valcv: holdout # Cross-validation strategy for the validation set
preprocessing/numerical: normalize # normalize or standardize
preprocessing/categorical: categoricalencode # categoricalencode or onehotencode
preprocessing/imputer: noimputation # simple or knn or iterative or noimputation
modeltypeparams@dlparams: dlparams # DO NOT CHANGE
modeltypeparams@mlparams: mlparams # DO NOT CHANGE
model: naim # Name of the model to use
modeltypeparams@train.dlparams: dlparams # DO NOT CHANGE
initializer@train.initializer: xavier_uniform # DO NOT CHANGE
loss@train.loss.CE: cross_entropy # DO NOT CHANGE
regularizer@train.regularizer.l1: l1 # DO NOT CHANGE
regularizer@train.regularizer.l2: l2 # DO NOT CHANGE
optimizer@train.optimizer: adam # DO NOT CHANGE
trainutils@train.manager: trainmanager # DO NOT CHANGE
metric@train.set_metrics.auc: auc # Metric to use for the early stopping
metric@performance_metrics.auc: auc # Metric to use for the performance evaluation
metric@performance_metrics.accuracy: accuracy # Metric to use for the performance evaluation
metric@performance_metrics.recall: recall # Metric to use for the performance evaluation
metric@performance_metrics.precision: precision # Metric to use for the performance evaluation
metric@performancemetrics.f1score: f1_score # Metric to use for the performance evaluation ```

The possible options for these parts are the files contained in the folders listed in the table below.

| Params | Keys | Options | |--------------------------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Dataset | databases@db | adult, bankmarketing, onlineshoppers, seismicbumps, spambase | | Cross Validation | cross_validation@test_cv cross_validation@val_cv | bootstrap, holdout, kfold, leaveoneout, predefined, stratifiedkfold | | Numerical Preprocessing | preprocessing/numerical | normalize, standardize | | Categorical Preprocessing | preprocessing/categorical | categorical_encode, onehotencode | | Imputation Strategy | preprocessing/imputer | simple, knn, iterative, no_imputation | | Model | model | naim, adaboost, dt, fttransformer, histgradientboostingtree, mlp_sklearn, rf, svm, tabnet, tabtransformer, xgboost | | Metrics | metric@train.set_metrics.<metric-name> metric@performance_metrics.<metric-name> | auc, accuracy, recall, precision, f1_score |

To modify some of the hyperparameters of the models, it is possible to modify the ml_params and dl_params files. For the ML models it is possible do define the number of estimators (n_estimators), whereas for the DL models it is possible to define the number of epochs (max_epochs), the warm-up number of epochs (min_epochs), the batchsize (`batchsize), the early stopping's (earlystoppingpatience) and the scheduler's (schedulerpatience) patience and their tolerance for performance improvement (performancetolerance), the device to use for training (device). It is also possible to define the learning rates to be tested (learningrates), but to be compatible with some of the competitor available in the models list, it is necessary to define also the initial learning rate (initlearningrate), and the final learning rate (endlearning_rate`).

./confs/experiment/modeltypeparams/ml_params.yaml

yaml n_estimators: 100 # Number of estimators for the ML models

./confs/experiment/modeltypeparams/dl_params.yaml

```yaml max_epochs: 1500 # Maximum number of epochs

min_epochs: 50 # Warm-up number of epochs

batch_size: 32 # Batch size

initlearningrate: 1e-3 # Initial learning rate

endlearningrate: 1e-8 # Final learning rate

learning_rates: [1e-3, 1e-4, 1e-5, 1e-6, 1e-7] # Learning rates for the scheduler

earlystoppingpatience: 50 # Patience for the early stopping

scheduler_patience: 25 # Patience for the scheduler

performance_tolerance: 1e-3 # Tolerance for the performance improvement

verbose: ${verbose} # DO NOT CHANGE

verbose_batch: 0 # 0 or 1 or ${verbose}, verbosity of the training for the batch

device: cuda # cpu or cuda, device to use for training ```

For a single experiment, this code produces a folder named <dataset-name>_<model-name>_<imputation-strategy> which contains everything generated by the code. In particular, the following folders and files are present:

cross_validation: this folder contains a folder for each training fold, indicated as a composition of test and validation folds <test-fold>_<val-fold>, reporting the information on the train, validation and test sets in 3 separate csv files.
preprocessing: this folder contains all the preprocessing information divided into 3 main folders:
1. numerical_preprocessing: in this folder there is a csv file for each fold reporting the information about the preprocessing params of numerical features.
2. categorical_preprocessing: in this folder there is a csv file for each fold reporting the information about the preprocessing params of categorical features.
3. imputer: in this folder there are a csv file for each fold reporting the information on the imputation strategy applied to handle missing values and a pkl file containing the imputer fitted on the training data.
saved_models: this folder contains a folder, named as the model, that for each fold includes a csv file with the model's parameters and a pkl or pth file containing the trained model.
predictions: this folder contains a folder that reports the predictions obtained from the train and validation and test sets separately. More specifically, for each fold there are 3 files reporting the predictions for the train, validation and test sets, called <test-fold>_<val-fold>_train, <test-fold>_<val-fold>_val and <test-fold>_<val-fold>_test respectively.
results: this folder reports the performance on the train, validation, and test sets separately. Specifically, for each set two folders, named balanced and unbalanced, containing the performance of the various sets are reported. The performance are presented in 3 separate files with increasing levels of averaging:
1. all_test_performance.csv: this file presents the set's performance evaluated for each fold and each class.
2. classes_average_performance.csv: this file, computing the average performance of the folds, contains the performance for each class.
3. set_average_performance.csv: this file, calculating the average performance of the folds and the classes, contains the average performance of the respective set.
config.yaml: this file contains the configuration file used as input for the experiment.
<experiment-name>.log: this is the log file of the experiment.

Contact

For any questions, please contact camillomaria.caruso@unicampus.it and valerio.guarrasi@unicampus.it.

Citation

bibtex @misc{caruso2024imputationmethodtransformerbasedmodel, title={Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets}, author={Camillo Maria Caruso and Paolo Soda and Valerio Guarrasi}, year={2024}, eprint={2407.11540}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2407.11540}, }

Owner

Name: CoSBi.dev
Login: cosbidev
Kind: organization
Location: Università Campus Bio-Medico di Roma

Website: https://www.unicampus.it/ricerca/unita-di-ricerca/sistemi-di-elaborazione-e-bioinformatica
Repositories: 3
Profile: https://github.com/cosbidev

GitHub Events

Total

Watch event: 6
Push event: 3
Fork event: 1

Last Year

Watch event: 6
Push event: 3
Fork event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/cosbidev/naim

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

NAIM

Installation

Usage

Reproducing the experiments

Train & Test on your dataset

Experiment declaration

Dataset preparation

here any pd.readcsv or pd.readexcel input parameter to correctly load the data can be added (e.g., navalues, header, indexcol)

Experiment configuration

./confs/experiment/modeltypeparams/ml_params.yaml

./confs/experiment/modeltypeparams/dl_params.yaml

Contact

Citation

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels