swansf-datapreprocessing-sampling-notebooks

These notebooks provide a comprehensive workflow, from start to finish, for processing and analyzing the SWAN-SF dataset. They include detailed steps for reading the dataset files, performing full preprocessing, and executing classification.

https://github.com/samresume/swansf-datapreprocessing-sampling-notebooks

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.4%) to scientific vocabulary

Keywords

data-preprocessing deep-learning gru imputation lstm machine-learning multivariate-timeseries normalization pandas python sampling smote solar-flare-prediction time-series-analysis time-series-classification timegan
Last synced: 6 months ago · JSON representation ·

Repository

These notebooks provide a comprehensive workflow, from start to finish, for processing and analyzing the SWAN-SF dataset. They include detailed steps for reading the dataset files, performing full preprocessing, and executing classification.

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
data-preprocessing deep-learning gru imputation lstm machine-learning multivariate-timeseries normalization pandas python sampling smote solar-flare-prediction time-series-analysis time-series-classification timegan
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Getting Started with the SWAN-SF Data Analysis

Welcome to our GitHub repository!

These notebooks provide a comprehensive workflow, from start to finish, for processing and analyzing the SWAN-SF dataset. They include detailed steps for reading the dataset files, performing full preprocessing, and generating a .pkl file for the processed data. Several missing value imputation techniques are implemented, such as Mean Imputation, Next-value Imputation, and our novel methodFast Pearson Correlation-based K-nearest Neighbors (FPCKNN) Imputation.

In addition, we address class overlap with the Near Decision Boundary Sample Removal (NDBSR) technique. Various normalization methods are also applied, including Min-Max Scaling, Z-Score Normalization, and our proprietary LSBZM (Log, Square Root, BoxCox, Z-Score, and Min-Max) Normalization technique.

The notebooks further implement multiple over-sampling techniques such as SMOTE, ADASYN, TimeGAN, and Gaussian Noise Injection (GNI), as well as two under-sampling methods: Random Under Sampling and Tomek Links. These preprocessing steps collectively enhance the classification performance for predicting solar flares.

The classification models used include SVM, Random Forest, k-NN, Multilayer Perceptron, LSTM, GRU, RNN, and 1D-CNN, all designed to predict solar flares within a 24-hour window.

By using these files, researchers can significantly reduce the time requiredby monthsto preprocess the SWAN-SF dataset, while achieving high accuracy in solar flare prediction.

SeriesGAN Architecture

Prerequisites

Before you start, make sure you have the following:

  • SWAN-SF Dataset: Download it from Harvard Dataverse.
  • Python Packages: Ensure you have these packages installed: pandas, numpy, matplotlib, seaborn, tensorflow, tqdm, pickle, sklearn, scipy, imblearn. The code for timegan is included in the repository, so no additional installation is required for this package.

Setting Up Your Environment

  1. Directory Setup: Modify the following lines in the source code to match your system's directory structure:

    python data_dir = "<Your path>/SWANSF/Downloaded_Data/" data_dir_save = "<Your path>/SWANSF/code/"

  2. Sequential Execution: Start from Notebook 1 and proceed in order. Each notebook relies on the data prepared in the previous steps.

Notebooks Overview

  • Notebook 1: Reads SWAN-SF samples and combines them into a single .pkl file (time series samples) and a .csv file (labels for each partition).
  • Notebook 2: Focuses on Missing Value Imputation, utilizing data from Notebook 1.
  • Notebook 3: Centers on Near Decision Boundary Sample Removal.
  • Notebook 4 & 5: Concentrate on Normalization.
  • Notebook 6: Offers Visualizations of the dataset.
  • Notebook 7 & 8: Implement Classification using eight classifiers.
  • Notebook 9: Applies Over-sampling techniques.
  • Notebook 10: Combines Over- and Under-sampling techniques.
  • Notebook 11, 12, & 13: Apply preprocessing techniques post-sampling (Normalization).
  • Notebook 14, 15, 16, & 17: Implement Classification using eight classifiers after Sampling.
  • Notebook 18: Presents Final Visualizations.

How To Cite

The paper associated with these notebooks has been published. We kindly ask you to provide a citation to acknowledge our work. Thank you for your support!

DOI: 10.3847/1538-4365/ad7c4a.

@article{EskandariNasab_2024, doi = {10.3847/1538-4365/ad7c4a}, url = {https://dx.doi.org/10.3847/1538-4365/ad7c4a}, year = {2024}, month = {oct}, publisher = {The American Astronomical Society}, volume = {275}, number = {1}, pages = {6}, author = {MohammadReza EskandariNasab and Shah Muhammad Hamdi and Soukaina Filali Boubrahimi}, title = {Impacts of Data Preprocessing and Sampling Techniques on Solar Flare Prediction from Multivariate Time Series Data of Photospheric Magnetic Field Parameters}, journal = {The Astrophysical Journal Supplement Series} }


Owner

  • Name: MohammadReza (Sam) EskandariNasab
  • Login: samresume
  • Kind: user
  • Location: Utah, United States
  • Company: Utah State University

Data Scientist and Machine Learning Engineer. Graduate Computer Science Student at USU

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: SWAN-SF Data Preprocessing and Sampling Notebooks
message: >-
  The paper concerning these notebooks is currently under
  review.
type: software
authors:
  - given-names: MohammadReza
    family-names: EskandariNasab
    email: reza.eskandarinasab@usu.edu
    affiliation: Utah State University
    orcid: 'https://orcid.org/0009-0004-0697-3716'
  - given-names: Shah Muhammad
    family-names: Hamdi
    email: s.hamdi@usu.edu
    affiliation: "Utah State University\t"
  - given-names: Soukaina
    family-names: Filali Boubrahimi
    affiliation: Utah State University
    email: soukaina.boubrahimi@usu.edu
identifiers:
  - type: doi
    value: 10.5281/zenodo.11564789
repository-code: >-
  https://github.com/samresume/SWANSF-DataPreprocessing-Sampling-Notebooks
repository: 'https://doi.org/10.5281/zenodo.11564789'
license: MIT
version: v1.0.0
date-released: '2024-06-11'

GitHub Events

Total
  • Push event: 2
Last Year
  • Push event: 2