stellar-classification

Classification of giant and dwarf stars using machine learning.

https://github.com/jpotter80/stellar-classification

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Classification of giant and dwarf stars using machine learning.

Basic Info
  • Host: GitHub
  • Owner: jpotter80
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 38.5 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

Stellar Classification Project

Overview

The Stellar Classification Project aims to classify stars into giant and dwarf categories using spectral data and machine learning models. The project involves several steps, including data loading, cleaning, exploration, transformation, and model training with hyperparameter tuning. The workflow is designed to preprocess the data, perform exploratory data analysis, apply necessary transformations, and build a machine learning model to achieve high classification accuracy.

Project Structure

  • load.py: Loads the original dataset.
  • clean.py: Cleans the dataset by handling missing values, outliers, and normalizing the data.
  • explore.py: Performs exploratory data analysis on the cleaned dataset.
  • log_transformation.py: Applies log transformation to reduce skewness in the data.
  • transformation_analysis.py: Analyzes the transformed data and generates visualizations.
  • random_forest.py: Trains a Random Forest model on the transformed data.
  • hyperparameter_tuning.py: Tunes the hyperparameters of the Random Forest model for optimal performance.

Workflow

1. Loading the Dataset

The load.py script loads the original dataset from a CSV file named Star99999_raw.csv and returns a pandas DataFrame.

Libraries Used:

  • pandas
  • os

2. Data Cleaning

The clean.py script cleans the dataset by handling missing values, outliers, and normalizing the data. Missing values in numeric columns (Vmag, Plx, e_Plx, B-V) are imputed using the median value. The SpType column is imputed with 'Unknown', and missing values in the StarType column are inferred using the cleaned data and initial classification results.

Libraries Used:

  • pandas
  • numpy
  • sklearn.preprocessing.MinMaxScaler

3. Exploratory Data Analysis

The explore.py script performs exploratory data analysis on the cleaned dataset, including generating summary statistics and visualizations.

Libraries Used:

  • pandas
  • seaborn
  • matplotlib.pyplot

4. Log Transformation

The log_transformation.py script applies log transformation to reduce skewness in the data and further cleans the data.

Libraries Used:

  • pandas
  • numpy
  • sklearn.preprocessing.MinMaxScaler

5. Transformation Analysis

The transformation_analysis.py script analyzes the transformed data and generates visualizations to understand the distribution and correlations.

Libraries Used:

  • pandas
  • seaborn
  • matplotlib.pyplot

6. Random Forest Model

The random_forest.py script trains a Random Forest model on the transformed data and evaluates its performance. It also saves the model and performance metrics.

Libraries Used:

  • pandas
  • sklearn.model_selection.train_test_split
  • sklearn.ensemble.RandomForestClassifier
  • sklearn.metrics.classification_report
  • sklearn.metrics.confusion_matrix
  • sklearn.metrics.accuracy_score
  • joblib
  • matplotlib.pyplot

7. Hyperparameter Tuning

The hyperparameter_tuning.py script tunes the hyperparameters of the Random Forest model using GridSearchCV to find the best parameters for optimal performance.

Libraries Used:

  • pandas
  • sklearn.model_selection.train_test_split
  • sklearn.ensemble.RandomForestClassifier
  • sklearn.metrics.classification_report
  • sklearn.metrics.confusion_matrix
  • sklearn.metrics.accuracy_score
  • sklearn.model_selection.GridSearchCV
  • joblib
  • matplotlib.pyplot

Results

Performance Metrics

The performance of the final Random Forest model is summarized below:

  • Accuracy: 83.27%
  • Confusion Matrix: [[7020 344] [ 912 4642]]
  • Classification Report: ``` precision recall f1-score support

         0       0.89      0.95      0.92      7364
         1       0.93      0.84      0.88      5554
    

    accuracy 0.90 12918 macro avg 0.91 0.89 0.90 12918 weighted avg 0.90 0.90 0.90 12918 ```

  • Precision (Dwarf): 0.77

  • Recall (Dwarf): 0.88

  • F1-Score (Dwarf): 0.82

  • Precision (Giant): 0.90

  • Recall (Giant): 0.79

  • F1-Score (Giant): 0.84

Usage

  1. Clone the repository.
  2. Ensure all dependencies are installed.
  3. Follow the workflow by executing each script in the specified order.

Contributing

Contributions are welcome. Please fork the repository and create a pull request with your changes.

License

This project is licensed under the MIT License.

Owner

  • Login: jpotter80
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
title: "Star Dataset for Stellar Classification"
type: "dataset"
authors:
  - family-names: "Ku"
    given-names: "Wing-Fung"
doi: "10.34740/KAGGLE/DSV/1433961"
url: "https://www.kaggle.com/dsv/1433961"
date-released: 2020-01-01
publisher: "Kaggle"

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • jpotter80 (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

poetry.lock pypi
  • asn1crypto 1.5.1
  • certifi 2024.2.2
  • charset-normalizer 3.3.2
  • contourpy 1.2.1
  • cycler 0.12.1
  • fonttools 4.51.0
  • greenlet 3.0.3
  • idna 3.7
  • joblib 1.4.2
  • kiwisolver 1.4.5
  • matplotlib 3.8.4
  • numpy 1.26.4
  • packaging 24.0
  • pandas 2.2.2
  • pg8000 1.31.2
  • pillow 10.3.0
  • pyparsing 3.1.2
  • python-dateutil 2.9.0.post0
  • pytz 2024.1
  • requests 2.31.0
  • scikit-learn 1.4.2
  • scipy 1.13.0
  • scramp 1.4.5
  • seaborn 0.13.2
  • six 1.16.0
  • sqlalchemy 2.0.30
  • threadpoolctl 3.5.0
  • typing-extensions 4.11.0
  • tzdata 2024.1
  • urllib3 2.2.1
pyproject.toml pypi
  • SQLAlchemy ^2.0.30
  • matplotlib ^3.8.4
  • numpy ^1.26.4
  • pg8000 ^1.31.2
  • python ^3.12
  • requests ^2.31.0
  • scikit-learn ^1.4.2
  • seaborn ^0.13.2
requirements.txt pypi
  • certifi ==2024.2.2
  • charset-normalizer ==3.3.2
  • contourpy ==1.2.1
  • cycler ==0.12.1
  • fonttools ==4.51.0
  • idna ==3.7
  • joblib ==1.4.2
  • kiwisolver ==1.4.5
  • matplotlib ==3.8.4
  • numpy ==1.26.4
  • packaging ==24.0
  • pandas ==2.2.2
  • pillow ==10.3.0
  • psycopg2 ==2.9.9
  • pyparsing ==3.1.2
  • python-dateutil ==2.9.0.post0
  • pytz ==2024.1
  • requests ==2.31.0
  • scikit-learn ==1.4.2
  • scipy ==1.13.0
  • seaborn ==0.13.2
  • six ==1.16.0
  • threadpoolctl ==3.5.0
  • tzdata ==2024.1
  • urllib3 ==2.2.1