swansf-datapreprocessing-sampling-notebooks
These notebooks provide a comprehensive workflow, from start to finish, for processing and analyzing the SWAN-SF dataset. They include detailed steps for reading the dataset files, performing full preprocessing, and executing classification.
https://github.com/samresume/swansf-datapreprocessing-sampling-notebooks
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.4%) to scientific vocabulary
Keywords
Repository
These notebooks provide a comprehensive workflow, from start to finish, for processing and analyzing the SWAN-SF dataset. They include detailed steps for reading the dataset files, performing full preprocessing, and executing classification.
Basic Info
- Host: GitHub
- Owner: samresume
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://iopscience.iop.org/article/10.3847/1538-4365/ad7c4a
- Size: 4.26 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Getting Started with the SWAN-SF Data Analysis
Welcome to our GitHub repository!
These notebooks provide a comprehensive workflow, from start to finish, for processing and analyzing the SWAN-SF dataset. They include detailed steps for reading the dataset files, performing full preprocessing, and generating a .pkl file for the processed data. Several missing value imputation techniques are implemented, such as Mean Imputation, Next-value Imputation, and our novel methodFast Pearson Correlation-based K-nearest Neighbors (FPCKNN) Imputation.
In addition, we address class overlap with the Near Decision Boundary Sample Removal (NDBSR) technique. Various normalization methods are also applied, including Min-Max Scaling, Z-Score Normalization, and our proprietary LSBZM (Log, Square Root, BoxCox, Z-Score, and Min-Max) Normalization technique.
The notebooks further implement multiple over-sampling techniques such as SMOTE, ADASYN, TimeGAN, and Gaussian Noise Injection (GNI), as well as two under-sampling methods: Random Under Sampling and Tomek Links. These preprocessing steps collectively enhance the classification performance for predicting solar flares.
The classification models used include SVM, Random Forest, k-NN, Multilayer Perceptron, LSTM, GRU, RNN, and 1D-CNN, all designed to predict solar flares within a 24-hour window.
By using these files, researchers can significantly reduce the time requiredby monthsto preprocess the SWAN-SF dataset, while achieving high accuracy in solar flare prediction.
Prerequisites
Before you start, make sure you have the following:
- SWAN-SF Dataset: Download it from Harvard Dataverse.
- Python Packages: Ensure you have these packages installed:
pandas,numpy,matplotlib,seaborn,tensorflow,tqdm,pickle,sklearn,scipy,imblearn. The code fortimeganis included in the repository, so no additional installation is required for this package.
Setting Up Your Environment
Directory Setup: Modify the following lines in the source code to match your system's directory structure:
python data_dir = "<Your path>/SWANSF/Downloaded_Data/" data_dir_save = "<Your path>/SWANSF/code/"Sequential Execution: Start from Notebook 1 and proceed in order. Each notebook relies on the data prepared in the previous steps.
Notebooks Overview
- Notebook 1: Reads SWAN-SF samples and combines them into a single
.pklfile (time series samples) and a.csvfile (labels for each partition). - Notebook 2: Focuses on Missing Value Imputation, utilizing data from Notebook 1.
- Notebook 3: Centers on Near Decision Boundary Sample Removal.
- Notebook 4 & 5: Concentrate on Normalization.
- Notebook 6: Offers Visualizations of the dataset.
- Notebook 7 & 8: Implement Classification using eight classifiers.
- Notebook 9: Applies Over-sampling techniques.
- Notebook 10: Combines Over- and Under-sampling techniques.
- Notebook 11, 12, & 13: Apply preprocessing techniques post-sampling (Normalization).
- Notebook 14, 15, 16, & 17: Implement Classification using eight classifiers after Sampling.
- Notebook 18: Presents Final Visualizations.
How To Cite
The paper associated with these notebooks has been published. We kindly ask you to provide a citation to acknowledge our work. Thank you for your support!
DOI: 10.3847/1538-4365/ad7c4a.
@article{EskandariNasab_2024,
doi = {10.3847/1538-4365/ad7c4a},
url = {https://dx.doi.org/10.3847/1538-4365/ad7c4a},
year = {2024},
month = {oct},
publisher = {The American Astronomical Society},
volume = {275},
number = {1},
pages = {6},
author = {MohammadReza EskandariNasab and Shah Muhammad Hamdi and Soukaina Filali Boubrahimi},
title = {Impacts of Data Preprocessing and Sampling Techniques on Solar Flare Prediction from Multivariate Time Series Data of Photospheric Magnetic Field Parameters},
journal = {The Astrophysical Journal Supplement Series}
}
Owner
- Name: MohammadReza (Sam) EskandariNasab
- Login: samresume
- Kind: user
- Location: Utah, United States
- Company: Utah State University
- Website: samresume.com
- Repositories: 1
- Profile: https://github.com/samresume
Data Scientist and Machine Learning Engineer. Graduate Computer Science Student at USU
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: SWAN-SF Data Preprocessing and Sampling Notebooks
message: >-
The paper concerning these notebooks is currently under
review.
type: software
authors:
- given-names: MohammadReza
family-names: EskandariNasab
email: reza.eskandarinasab@usu.edu
affiliation: Utah State University
orcid: 'https://orcid.org/0009-0004-0697-3716'
- given-names: Shah Muhammad
family-names: Hamdi
email: s.hamdi@usu.edu
affiliation: "Utah State University\t"
- given-names: Soukaina
family-names: Filali Boubrahimi
affiliation: Utah State University
email: soukaina.boubrahimi@usu.edu
identifiers:
- type: doi
value: 10.5281/zenodo.11564789
repository-code: >-
https://github.com/samresume/SWANSF-DataPreprocessing-Sampling-Notebooks
repository: 'https://doi.org/10.5281/zenodo.11564789'
license: MIT
version: v1.0.0
date-released: '2024-06-11'
GitHub Events
Total
- Push event: 2
Last Year
- Push event: 2