cleaned-swansf-dataset

The SWAN-SF dataset is now fully preprocessed, optimized, and ready for binary classification tasks. Our team is excited to release the enhanced version of the SWAN-SF dataset across all five partitions.

https://github.com/samresume/cleaned-swansf-dataset

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: nature.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.5%) to scientific vocabulary

Keywords

benchmark-dataset dataset multivariate-timeseries preprocessing solar-flare-prediction swansf time-series-analysis time-series-classification timegan
Last synced: 4 months ago · JSON representation ·

Repository

The SWAN-SF dataset is now fully preprocessed, optimized, and ready for binary classification tasks. Our team is excited to release the enhanced version of the SWAN-SF dataset across all five partitions.

Basic Info
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
benchmark-dataset dataset multivariate-timeseries preprocessing solar-flare-prediction swansf time-series-analysis time-series-classification timegan
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Cleaned SWAN-SF Dataset

Introduction

The SWAN-SF dataset is now fully preprocessed, optimized, and ready for binary classification tasks. Our team is excited to release the enhanced version of the SWAN-SF dataset across all five partitions. This version benefits from our FPCKNN imputation technique, the elimination of Class C samples to address class overlap issues, and the implementation of TimeGAN, Tomek Links, and Random Under Sampling as over- and under-sampling strategies. With LSBZM normalization applied, our optimized dataset empowers researchers to develop more precise classifiers by focusing on analysis rather than preprocessing steps, aiming to improve the TSS score significantly.

SeriesGAN Architecture

Original Dataset

The unpreprocessed version of the SWAN-SF dataset can be accessed on the Harvard Dataverse: - SWAN-SF Dataset on Harvard Dataverse

For more detailed information about the SWAN-SF dataset, please refer to the following paper: - Flare Prediction from Multivariate Time Series Data of Photospheric Magnetic Field Parameters

How to Use

Please download the dataset through the Google Drive link provided in the download.txt file. The training partitions encompass every phase of our data preprocessing pipeline, including various sampling techniques. Conversely, the test partitions exclusively incorporate imputation and normalization procedures, without the application of any sampling techniques.

Preprocessing Pipeline

For each partition, data and labels are kept separate to maintain clarity and organization.

Setup

Ensure you have pickle and numpy packages installed in your environment. Use the Python code below to load the data into an array:

```python import pickle import numpy as np

Training Partitons

datadir = "/path/to/your/PreprocessedSWANSF/train/" Xtrain = [] ytrain = [] num_partitions = 5

for i in range(numpartitions): with open(f"{datadir}Partition{i+1}RUS-Tomek-TimeGANLSBZM-NormWithoutCFPCKNN-impute.pkl", 'rb') as f: Xtrain.append(pickle.load(f)) with open(f"{datadir}Partition{i+1}LabelsRUS-Tomek-TimeGANLSBZM-NormWithoutCFPCKNN-impute.pkl", 'rb') as f: ytrain.append(pickle.load(f)) ```

```python import pickle import numpy as np

Test Partitons

datadir = "/path/to/your/PreprocessedSWANSF/test/" Xtest = [] ytest = [] num_partitions = 5

for i in range(numpartitions): with open(f"{datadir}Partition{i+1}LSBZM-NormFPCKNN-impute.pkl", 'rb') as f: Xtest.append(pickle.load(f)) with open(f"{datadir}Partition{i+1}LabelsLSBZM-NormFPCKNN-impute.pkl", 'rb') as f: ytest.append(pickle.load(f)) ```

Data Structure

Each partition is stored in a 3D .pkl file, with the shape (num_samples, num_timestamps, num_attributes).

Attributes Order

The order of the attributes is as follows: ['R_VALUE', 'TOTUSJH', 'TOTBSQ', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH', 'SAVNCPP', 'USFLUX', 'TOTFZ', 'MEANPOT', 'EPSX', 'EPSY', 'EPSZ', 'MEANSHR', 'SHRGT45', 'MEANGAM', 'MEANGBT', 'MEANGBZ', 'MEANGBH', 'MEANJZH', 'TOTFY', 'MEANJZD', 'MEANALP', 'TOTFX']

Data Interpretation Examples

  • X_train[0][0,:,0] corresponds to the R_VALUE attribute of the first sample of partition 1. This gives you the time series data for the R_VALUE attribute for the first sample.
  • X_train[3][20,:,1] corresponds to the TOTUSJH attribute of the twenty-first sample of partition 4. Here, you're accessing the time series data for the TOTUSJH attribute for a specific sample in partition 4.

The y_train files hold the labels for the samples, organized in a 1D vector: - y_train[0][2] corresponds to the label of the third Multivariate Time Series (MVTS) sample of partition 1, which can be 0 or 1, indicating the binary classification target.

How To Cite

Our paper detailing this preprocessed dataset has been published. We kindly ask you to provide a citation to acknowledge our work. Thank you for your support!

DOI: 10.3847/1538-4365/ad7c4a.

@article{EskandariNasab_2024, doi = {10.3847/1538-4365/ad7c4a}, url = {https://dx.doi.org/10.3847/1538-4365/ad7c4a}, year = {2024}, month = {oct}, publisher = {The American Astronomical Society}, volume = {275}, number = {1}, pages = {6}, author = {MohammadReza EskandariNasab and Shah Muhammad Hamdi and Soukaina Filali Boubrahimi}, title = {Impacts of Data Preprocessing and Sampling Techniques on Solar Flare Prediction from Multivariate Time Series Data of Photospheric Magnetic Field Parameters}, journal = {The Astrophysical Journal Supplement Series} }

Owner

  • Name: MohammadReza (Sam) EskandariNasab
  • Login: samresume
  • Kind: user
  • Location: Utah, United States
  • Company: Utah State University

Data Scientist and Machine Learning Engineer. Graduate Computer Science Student at USU

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Cleaned SWANSF Dataset
message: >-
  The paper related to this dataset is presently under
  review.
type: dataset
authors:
  - given-names: MohammadReza
    family-names: EskandariNasab
    email: reza.eskandarinasab@usu.edu
    affiliation: Utah State University
    orcid: 'https://orcid.org/0009-0004-0697-3716'
  - given-names: Shah Muhammad
    family-names: Hamdi
    email: s.hamdi@usu.edu
    affiliation: "Utah State University\t"
  - given-names: Soukaina
    family-names: Filali Boubrahimi
    affiliation: Utah State University
    email: soukaina.boubrahimi@usu.edu
identifiers:
  - type: doi
    value: 10.5281/zenodo.11566472
repository-code: >-
  https://github.com/samresume/Cleaned-SWANSF-Dataset
repository: 'https://doi.org/10.5281/zenodo.11566472'
license: MIT
version: v1.0.0
date-released: '2024-06-10'

GitHub Events

Total
  • Watch event: 1
  • Push event: 6
Last Year
  • Watch event: 1
  • Push event: 6