cleaned-swansf-dataset
The SWAN-SF dataset is now fully preprocessed, optimized, and ready for binary classification tasks. Our team is excited to release the enhanced version of the SWAN-SF dataset across all five partitions.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary
Keywords
Repository
The SWAN-SF dataset is now fully preprocessed, optimized, and ready for binary classification tasks. Our team is excited to release the enhanced version of the SWAN-SF dataset across all five partitions.
Basic Info
- Host: GitHub
- Owner: samresume
- License: mit
- Default Branch: main
- Homepage: https://iopscience.iop.org/article/10.3847/1538-4365/ad7c4a
- Size: 323 KB
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Cleaned SWAN-SF Dataset
Introduction
The SWAN-SF dataset is now fully preprocessed, optimized, and ready for binary classification tasks. Our team is excited to release the enhanced version of the SWAN-SF dataset across all five partitions. This version benefits from our FPCKNN imputation technique, the elimination of Class C samples to address class overlap issues, and the implementation of TimeGAN, Tomek Links, and Random Under Sampling as over- and under-sampling strategies. With LSBZM normalization applied, our optimized dataset empowers researchers to develop more precise classifiers by focusing on analysis rather than preprocessing steps, aiming to improve the TSS score significantly.
Original Dataset
The unpreprocessed version of the SWAN-SF dataset can be accessed on the Harvard Dataverse: - SWAN-SF Dataset on Harvard Dataverse
For more detailed information about the SWAN-SF dataset, please refer to the following paper: - Flare Prediction from Multivariate Time Series Data of Photospheric Magnetic Field Parameters
How to Use
Please download the dataset through the Google Drive link provided in the download.txt file.
The training partitions encompass every phase of our data preprocessing pipeline, including various sampling techniques. Conversely, the test partitions exclusively incorporate imputation and normalization procedures, without the application of any sampling techniques.
For each partition, data and labels are kept separate to maintain clarity and organization.
Setup
Ensure you have pickle and numpy packages installed in your environment. Use the Python code below to load the data into an array:
```python import pickle import numpy as np
Training Partitons
datadir = "/path/to/your/PreprocessedSWANSF/train/" Xtrain = [] ytrain = [] num_partitions = 5
for i in range(numpartitions): with open(f"{datadir}Partition{i+1}RUS-Tomek-TimeGANLSBZM-NormWithoutCFPCKNN-impute.pkl", 'rb') as f: Xtrain.append(pickle.load(f)) with open(f"{datadir}Partition{i+1}LabelsRUS-Tomek-TimeGANLSBZM-NormWithoutCFPCKNN-impute.pkl", 'rb') as f: ytrain.append(pickle.load(f)) ```
```python import pickle import numpy as np
Test Partitons
datadir = "/path/to/your/PreprocessedSWANSF/test/" Xtest = [] ytest = [] num_partitions = 5
for i in range(numpartitions): with open(f"{datadir}Partition{i+1}LSBZM-NormFPCKNN-impute.pkl", 'rb') as f: Xtest.append(pickle.load(f)) with open(f"{datadir}Partition{i+1}LabelsLSBZM-NormFPCKNN-impute.pkl", 'rb') as f: ytest.append(pickle.load(f)) ```
Data Structure
Each partition is stored in a 3D .pkl file, with the shape (num_samples, num_timestamps, num_attributes).
Attributes Order
The order of the attributes is as follows:
['R_VALUE', 'TOTUSJH', 'TOTBSQ', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH', 'SAVNCPP', 'USFLUX', 'TOTFZ', 'MEANPOT', 'EPSX', 'EPSY', 'EPSZ', 'MEANSHR', 'SHRGT45', 'MEANGAM', 'MEANGBT', 'MEANGBZ', 'MEANGBH', 'MEANJZH', 'TOTFY', 'MEANJZD', 'MEANALP', 'TOTFX']
Data Interpretation Examples
X_train[0][0,:,0]corresponds to theR_VALUEattribute of the first sample of partition 1. This gives you the time series data for theR_VALUEattribute for the first sample.X_train[3][20,:,1]corresponds to theTOTUSJHattribute of the twenty-first sample of partition 4. Here, you're accessing the time series data for theTOTUSJHattribute for a specific sample in partition 4.
The y_train files hold the labels for the samples, organized in a 1D vector:
- y_train[0][2] corresponds to the label of the third Multivariate Time Series (MVTS) sample of partition 1, which can be 0 or 1, indicating the binary classification target.
How To Cite
Our paper detailing this preprocessed dataset has been published. We kindly ask you to provide a citation to acknowledge our work. Thank you for your support!
DOI: 10.3847/1538-4365/ad7c4a.
@article{EskandariNasab_2024,
doi = {10.3847/1538-4365/ad7c4a},
url = {https://dx.doi.org/10.3847/1538-4365/ad7c4a},
year = {2024},
month = {oct},
publisher = {The American Astronomical Society},
volume = {275},
number = {1},
pages = {6},
author = {MohammadReza EskandariNasab and Shah Muhammad Hamdi and Soukaina Filali Boubrahimi},
title = {Impacts of Data Preprocessing and Sampling Techniques on Solar Flare Prediction from Multivariate Time Series Data of Photospheric Magnetic Field Parameters},
journal = {The Astrophysical Journal Supplement Series}
}
Owner
- Name: MohammadReza (Sam) EskandariNasab
- Login: samresume
- Kind: user
- Location: Utah, United States
- Company: Utah State University
- Website: samresume.com
- Repositories: 1
- Profile: https://github.com/samresume
Data Scientist and Machine Learning Engineer. Graduate Computer Science Student at USU
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Cleaned SWANSF Dataset
message: >-
The paper related to this dataset is presently under
review.
type: dataset
authors:
- given-names: MohammadReza
family-names: EskandariNasab
email: reza.eskandarinasab@usu.edu
affiliation: Utah State University
orcid: 'https://orcid.org/0009-0004-0697-3716'
- given-names: Shah Muhammad
family-names: Hamdi
email: s.hamdi@usu.edu
affiliation: "Utah State University\t"
- given-names: Soukaina
family-names: Filali Boubrahimi
affiliation: Utah State University
email: soukaina.boubrahimi@usu.edu
identifiers:
- type: doi
value: 10.5281/zenodo.11566472
repository-code: >-
https://github.com/samresume/Cleaned-SWANSF-Dataset
repository: 'https://doi.org/10.5281/zenodo.11566472'
license: MIT
version: v1.0.0
date-released: '2024-06-10'
GitHub Events
Total
- Watch event: 1
- Push event: 6
Last Year
- Watch event: 1
- Push event: 6