stellar-classification

Classification of giant and dwarf stars using machine learning.

https://github.com/jpotter80/stellar-classification

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Classification of giant and dwarf stars using machine learning.

Basic Info

Host: GitHub
Owner: jpotter80
License: mit
Language: Python
Default Branch: main
Size: 38.5 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

Stellar Classification Project

Overview

The Stellar Classification Project aims to classify stars into giant and dwarf categories using spectral data and machine learning models. The project involves several steps, including data loading, cleaning, exploration, transformation, and model training with hyperparameter tuning. The workflow is designed to preprocess the data, perform exploratory data analysis, apply necessary transformations, and build a machine learning model to achieve high classification accuracy.

Project Structure

load.py: Loads the original dataset.
clean.py: Cleans the dataset by handling missing values, outliers, and normalizing the data.
explore.py: Performs exploratory data analysis on the cleaned dataset.
log_transformation.py: Applies log transformation to reduce skewness in the data.
transformation_analysis.py: Analyzes the transformed data and generates visualizations.
random_forest.py: Trains a Random Forest model on the transformed data.
hyperparameter_tuning.py: Tunes the hyperparameters of the Random Forest model for optimal performance.

Workflow

1. Loading the Dataset

The load.py script loads the original dataset from a CSV file named Star99999_raw.csv and returns a pandas DataFrame.

Libraries Used:

pandas
os

2. Data Cleaning

The clean.py script cleans the dataset by handling missing values, outliers, and normalizing the data. Missing values in numeric columns (Vmag, Plx, e_Plx, B-V) are imputed using the median value. The SpType column is imputed with 'Unknown', and missing values in the StarType column are inferred using the cleaned data and initial classification results.

Libraries Used:

pandas
numpy
sklearn.preprocessing.MinMaxScaler

3. Exploratory Data Analysis

The explore.py script performs exploratory data analysis on the cleaned dataset, including generating summary statistics and visualizations.

Libraries Used:

pandas
seaborn
matplotlib.pyplot

4. Log Transformation

The log_transformation.py script applies log transformation to reduce skewness in the data and further cleans the data.

Libraries Used:

pandas
numpy
sklearn.preprocessing.MinMaxScaler

5. Transformation Analysis

The transformation_analysis.py script analyzes the transformed data and generates visualizations to understand the distribution and correlations.

Libraries Used:

pandas
seaborn
matplotlib.pyplot

6. Random Forest Model

The random_forest.py script trains a Random Forest model on the transformed data and evaluates its performance. It also saves the model and performance metrics.

Libraries Used:

pandas
sklearn.model_selection.train_test_split
sklearn.ensemble.RandomForestClassifier
sklearn.metrics.classification_report
sklearn.metrics.confusion_matrix
sklearn.metrics.accuracy_score
joblib
matplotlib.pyplot

7. Hyperparameter Tuning

The hyperparameter_tuning.py script tunes the hyperparameters of the Random Forest model using GridSearchCV to find the best parameters for optimal performance.

Libraries Used:

pandas
sklearn.model_selection.train_test_split
sklearn.ensemble.RandomForestClassifier
sklearn.metrics.classification_report
sklearn.metrics.confusion_matrix
sklearn.metrics.accuracy_score
sklearn.model_selection.GridSearchCV
joblib
matplotlib.pyplot

Results

Performance Metrics

The performance of the final Random Forest model is summarized below:

Accuracy: 83.27%
Confusion Matrix: [[7020 344] [ 912 4642]]
Classification Report: ``` precision recall f1-score support
```
     0       0.89      0.95      0.92      7364
     1       0.93      0.84      0.88      5554
```
accuracy 0.90 12918 macro avg 0.91 0.89 0.90 12918 weighted avg 0.90 0.90 0.90 12918 ```
Precision (Dwarf): 0.77
Recall (Dwarf): 0.88
F1-Score (Dwarf): 0.82
Precision (Giant): 0.90
Recall (Giant): 0.79
F1-Score (Giant): 0.84

Usage

Clone the repository.
Ensure all dependencies are installed.
Follow the workflow by executing each script in the specified order.

Contributing

Contributions are welcome. Please fork the repository and create a pull request with your changes.

License

This project is licensed under the MIT License.

Owner

Login: jpotter80
Kind: user

Repositories: 1
Profile: https://github.com/jpotter80

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
title: "Star Dataset for Stellar Classification"
type: "dataset"
authors:
  - family-names: "Ku"
    given-names: "Wing-Fung"
doi: "10.34740/KAGGLE/DSV/1433961"
url: "https://www.kaggle.com/dsv/1433961"
date-released: 2020-01-01
publisher: "Kaggle"

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

jpotter80 (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

poetry.lock pypi

asn1crypto 1.5.1
certifi 2024.2.2
charset-normalizer 3.3.2
contourpy 1.2.1
cycler 0.12.1
fonttools 4.51.0
greenlet 3.0.3
idna 3.7
joblib 1.4.2
kiwisolver 1.4.5
matplotlib 3.8.4
numpy 1.26.4
packaging 24.0
pandas 2.2.2
pg8000 1.31.2
pillow 10.3.0
pyparsing 3.1.2
python-dateutil 2.9.0.post0
pytz 2024.1
requests 2.31.0
scikit-learn 1.4.2
scipy 1.13.0
scramp 1.4.5
seaborn 0.13.2
six 1.16.0
sqlalchemy 2.0.30
threadpoolctl 3.5.0
typing-extensions 4.11.0
tzdata 2024.1
urllib3 2.2.1

pyproject.toml pypi

SQLAlchemy ^2.0.30
matplotlib ^3.8.4
numpy ^1.26.4
pg8000 ^1.31.2
python ^3.12
requests ^2.31.0
scikit-learn ^1.4.2
seaborn ^0.13.2

requirements.txt pypi

certifi ==2024.2.2
charset-normalizer ==3.3.2
contourpy ==1.2.1
cycler ==0.12.1
fonttools ==4.51.0
idna ==3.7
joblib ==1.4.2
kiwisolver ==1.4.5
matplotlib ==3.8.4
numpy ==1.26.4
packaging ==24.0
pandas ==2.2.2
pillow ==10.3.0
psycopg2 ==2.9.9
pyparsing ==3.1.2
python-dateutil ==2.9.0.post0
pytz ==2024.1
requests ==2.31.0
scikit-learn ==1.4.2
scipy ==1.13.0
seaborn ==0.13.2
six ==1.16.0
threadpoolctl ==3.5.0
tzdata ==2024.1
urllib3 ==2.2.1

stellar-classification

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Stellar Classification Project

Overview

Project Structure

Workflow

1. Loading the Dataset

Libraries Used:

2. Data Cleaning

Libraries Used:

3. Exploratory Data Analysis

Libraries Used:

4. Log Transformation

Libraries Used:

5. Transformation Analysis

Libraries Used:

6. Random Forest Model

Libraries Used:

7. Hyperparameter Tuning

Libraries Used:

Results

Performance Metrics

Usage

Contributing

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies