cleaned-swansf-dataset

The SWAN-SF dataset is now fully preprocessed, optimized, and ready for binary classification tasks. Our team is excited to release the enhanced version of the SWAN-SF dataset across all five partitions.

https://github.com/samresume/cleaned-swansf-dataset

Keywords

benchmark-dataset dataset multivariate-timeseries preprocessing solar-flare-prediction swansf time-series-analysis time-series-classification timegan

Last synced: 6 months ago · JSON representation ·

Repository

The SWAN-SF dataset is now fully preprocessed, optimized, and ready for binary classification tasks. Our team is excited to release the enhanced version of the SWAN-SF dataset across all five partitions.

Basic Info

Host: GitHub
Owner: samresume
License: mit
Default Branch: main
Homepage: https://iopscience.iop.org/article/10.3847/1538-4365/ad7c4a
Size: 323 KB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Topics

benchmark-dataset dataset multivariate-timeseries preprocessing solar-flare-prediction swansf time-series-analysis time-series-classification timegan

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

README.md

Cleaned SWAN-SF Dataset

Introduction

The SWAN-SF dataset is now fully preprocessed, optimized, and ready for binary classification tasks. Our team is excited to release the enhanced version of the SWAN-SF dataset across all five partitions. This version benefits from our FPCKNN imputation technique, the elimination of Class C samples to address class overlap issues, and the implementation of TimeGAN, Tomek Links, and Random Under Sampling as over- and under-sampling strategies. With LSBZM normalization applied, our optimized dataset empowers researchers to develop more precise classifiers by focusing on analysis rather than preprocessing steps, aiming to improve the TSS score significantly.

SeriesGAN Architecture

Original Dataset

The unpreprocessed version of the SWAN-SF dataset can be accessed on the Harvard Dataverse: - SWAN-SF Dataset on Harvard Dataverse

For more detailed information about the SWAN-SF dataset, please refer to the following paper: - Flare Prediction from Multivariate Time Series Data of Photospheric Magnetic Field Parameters

How to Use

Please download the dataset through the Google Drive link provided in the download.txt file. The training partitions encompass every phase of our data preprocessing pipeline, including various sampling techniques. Conversely, the test partitions exclusively incorporate imputation and normalization procedures, without the application of any sampling techniques.

Preprocessing Pipeline

For each partition, data and labels are kept separate to maintain clarity and organization.

Setup

Ensure you have pickle and numpy packages installed in your environment. Use the Python code below to load the data into an array:

```python import pickle import numpy as np

Training Partitons

datadir = "/path/to/your/PreprocessedSWANSF/train/" Xtrain = [] ytrain = [] num_partitions = 5

for i in range(numpartitions): with open(f"{datadir}Partition{i+1}RUS-Tomek-TimeGANLSBZM-NormWithoutCFPCKNN-impute.pkl", 'rb') as f: Xtrain.append(pickle.load(f)) with open(f"{datadir}Partition{i+1}LabelsRUS-Tomek-TimeGANLSBZM-NormWithoutCFPCKNN-impute.pkl", 'rb') as f: ytrain.append(pickle.load(f)) ```

```python import pickle import numpy as np

Test Partitons

datadir = "/path/to/your/PreprocessedSWANSF/test/" Xtest = [] ytest = [] num_partitions = 5

for i in range(numpartitions): with open(f"{datadir}Partition{i+1}LSBZM-NormFPCKNN-impute.pkl", 'rb') as f: Xtest.append(pickle.load(f)) with open(f"{datadir}Partition{i+1}LabelsLSBZM-NormFPCKNN-impute.pkl", 'rb') as f: ytest.append(pickle.load(f)) ```

Data Structure

Each partition is stored in a 3D .pkl file, with the shape (num_samples, num_timestamps, num_attributes).

Attributes Order

The order of the attributes is as follows: ['R_VALUE', 'TOTUSJH', 'TOTBSQ', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH', 'SAVNCPP', 'USFLUX', 'TOTFZ', 'MEANPOT', 'EPSX', 'EPSY', 'EPSZ', 'MEANSHR', 'SHRGT45', 'MEANGAM', 'MEANGBT', 'MEANGBZ', 'MEANGBH', 'MEANJZH', 'TOTFY', 'MEANJZD', 'MEANALP', 'TOTFX']

Data Interpretation Examples

X_train[0][0,:,0] corresponds to the R_VALUE attribute of the first sample of partition 1. This gives you the time series data for the R_VALUE attribute for the first sample.
X_train[3][20,:,1] corresponds to the TOTUSJH attribute of the twenty-first sample of partition 4. Here, you're accessing the time series data for the TOTUSJH attribute for a specific sample in partition 4.

The y_train files hold the labels for the samples, organized in a 1D vector: - y_train[0][2] corresponds to the label of the third Multivariate Time Series (MVTS) sample of partition 1, which can be 0 or 1, indicating the binary classification target.

How To Cite

Our paper detailing this preprocessed dataset has been published. We kindly ask you to provide a citation to acknowledge our work. Thank you for your support!

DOI: 10.3847/1538-4365/ad7c4a.

@article{EskandariNasab_2024, doi = {10.3847/1538-4365/ad7c4a}, url = {https://dx.doi.org/10.3847/1538-4365/ad7c4a}, year = {2024}, month = {oct}, publisher = {The American Astronomical Society}, volume = {275}, number = {1}, pages = {6}, author = {MohammadReza EskandariNasab and Shah Muhammad Hamdi and Soukaina Filali Boubrahimi}, title = {Impacts of Data Preprocessing and Sampling Techniques on Solar Flare Prediction from Multivariate Time Series Data of Photospheric Magnetic Field Parameters}, journal = {The Astrophysical Journal Supplement Series} }

Owner

Name: MohammadReza (Sam) EskandariNasab
Login: samresume
Kind: user
Location: Utah, United States
Company: Utah State University

Website: samresume.com
Repositories: 1
Profile: https://github.com/samresume

Data Scientist and Machine Learning Engineer. Graduate Computer Science Student at USU

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Cleaned SWANSF Dataset
message: >-
  The paper related to this dataset is presently under
  review.
type: dataset
authors:
  - given-names: MohammadReza
    family-names: EskandariNasab
    email: reza.eskandarinasab@usu.edu
    affiliation: Utah State University
    orcid: 'https://orcid.org/0009-0004-0697-3716'
  - given-names: Shah Muhammad
    family-names: Hamdi
    email: s.hamdi@usu.edu
    affiliation: "Utah State University\t"
  - given-names: Soukaina
    family-names: Filali Boubrahimi
    affiliation: Utah State University
    email: soukaina.boubrahimi@usu.edu
identifiers:
  - type: doi
    value: 10.5281/zenodo.11566472
repository-code: >-
  https://github.com/samresume/Cleaned-SWANSF-Dataset
repository: 'https://doi.org/10.5281/zenodo.11566472'
license: MIT
version: v1.0.0
date-released: '2024-06-10'

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science