regression-algorithms-from-scratch

https://github.com/krishnaaggarwal2003/regression-algorithms-from-scratch

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: KrishnaAggarwal2003
License: mit
Language: Jupyter Notebook
Default Branch: main
Size: 262 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Regression Algorithms from Scratch

This repository demonstrates Linear Regression and Logistic Regression implemented from scratch using NumPy, with detailed training, evaluation, and visualisation. The goal is to provide a clear, educational look at how these foundational machine learning algorithms work under the hood for model fitting.

LR_code.ipynb: Linear Regression from scratch (for continuous targets)
Logistic.ipynb: Logistic Regression from scratch (for binary classification)

1. Linear Regression (`LR_code.ipynb`)

Overview

Data Generation: Synthetic data is created with random features, coefficients, and Gaussian noise.
Model: Implements multivariate linear regression using gradient descent.
Training: Tracks cost (MSE) and R² score (accuracy) over epochs, with early stopping.
Evaluation: Reports MSE, MAE, R², and visualizes predictions, residuals, and learned coefficients.

Key Steps

Data Creation:
- Features (X), coefficients (beta), and noise are randomly generated.
- Target (Y) is computed as a linear combination of features plus noise.
Model Training:
- Custom LinearRegression class with manual gradient descent.
- Updates both coefficients and intercept.
- Early stopping if the cost converges.
Evaluation & Visualization:
- Calculates MSE, MAE, and R² on the test set.
- Plots predicted vs. actual values, residual distribution, and compares true vs. learned coefficients.

Output obtained from Test-data

``` Epoch 0/500, Cost: 2.6241, Accuracy: -181.53% ... Epoch 499/500, Cost: 0.0003, Accuracy: 99.97%

Range of Y data: -354.16 to 400.74, i.e. 754.9 Mean-squared Error: 3.6998 Mean-Absolute Error: 1.5285 R² score (Accuracy): 99.9647% ``` The model achieved excellent results. With a very high R² score of 99.9647%, it explains almost all of the variance in the Y data. The low Mean Squared Error (3.6998) and Mean Absolute Error (1.5285) relative to the wide range of the Y data (754.9) further confirm the model's high accuracy and small prediction errors.

Plot

The graph clearly shows that the blue predicted points cluster tightly around the red "Ideal Fit" line. This strong alignment visually confirms the model's excellent performance, as indicated by the high R² score (99.9647%) and low error metrics previously discussed. The model's predictions are remarkably close to the true values across the entire range of data.

This "Distribution of Residuals" histogram demonstrates the model's excellent performance by showing that its errors are normally distributed and centred around zero. This ideal distribution indicates that most predictions are highly accurate with small, unbiased errors, reinforcing the model's overall robustness and reliability.

This graph visually confirms the model's success in learning the underlying data relationships, as the "Learned Coefficients" (black bars) closely mirror the "Actual Coefficients" (blue bars) in both magnitude and direction for each feature variable. This strong alignment demonstrates the model's high accuracy in identifying the true influence of each feature on the target.

2. Logistic Regression (`Logistic.ipynb`)

Overview

Data Generation: Uses make_classification to create a synthetic binary classification dataset, with optional label noise.
Model: Implements logistic regression with options for L1, L2, or combined regularization.
Training: Uses gradient descent, tracks loss and accuracy, and supports early stopping.
Evaluation: Reports classification metrics, confusion matrix, ROC curve, and visualises loss/accuracy curves.

Key Steps

Data Creation:
- Features are standardised, and a bias term is added.
- Optional label noise for realism.
Model Training:
- Custom LogisticRegression class with manual gradient descent.
- Supports L1, L2, and combined regularization.
- Tracks cost and accuracy per epoch.
Evaluation & Visualization:
- Classification report (precision, recall, f1-score, accuracy).
- Plots: loss curve, accuracy curve, confusion matrix, ROC curve with AUC.

Output from Test-data

Epoch 0/2000, Cost: 0.9063, Accuracy: 61.18% ... Epoch 1999/2000, Cost: 0.5265, Accuracy: 88.98% Classification Report

The "Loss Curve for Logistic Regression" illustrates the model's optimisation process. The rapid decrease in loss followed by its convergence to a stable minimum demonstrates effective training and efficient parameter optimisation, indicating the model successfully learned from the data and reached a state of optimal performance.

The plot above shows the progression of model accuracy over 2000 training iterations. Initially, accuracy increases rapidly, indicating that the model is learning effectively during the early phase of training. Around iteration 500, accuracy begins to plateau near 0.89, suggesting the model has reached convergence. After this point, the performance stabilizes with minimal fluctuation, reflecting a well-trained model with consistent accuracy.

Confusion Matrix

The confusion matrix summarises the classification performance of the model on the test dataset: - True Positives (1 predicted as 1): 1800 - True Negatives (0 predicted as 0): 1721 - False Positives (0 predicted as 1): 291 - False Negatives (1 predicted as 0): 188

The model demonstrates strong classification performance for both classes, with a relatively low number of misclassifications. It handles class 1 slightly better than class 0, as indicated by fewer false negatives. This matrix reinforces the overall high accuracy observed during training.

ROC curve

The ROC (Receiver Operating Characteristic) curve visualises the model's diagnostic ability across various threshold settings. The curve shows a strong upward trajectory with an Area Under the Curve (AUC) of 0.89, indicating that the model has high discriminative power in distinguishing between the two classes. An AUC close to 1.0 reflects a robust classifier, and the observed value of 0.89 suggests that the model maintains an excellent balance between sensitivity (true positive rate) and specificity (false positive rate).

3. Educational Value

No high-level model fitting: All learning logic is implemented manually.
Step-by-step: Each notebook walks through data creation, model logic, training, and evaluation.
Visualisation: Plots help interpret model performance and learning dynamics.

4. Requirements

Python 3.x
NumPy
scikit-learn
matplotlib
seaborn
tqdm
pandas

Install requirements with: ```bash pip install numpy scikit-learn matplotlib seaborn tqdm pandas

```

5. License

This repository is licensed under the MIT License. It is intended for educational and research purposes and demonstrates the inner workings of linear and logistic regression, including gradient descent, regularisation techniques, and performance evaluation metrics.

Owner

Login: KrishnaAggarwal2003
Kind: user

Repositories: 1
Profile: https://github.com/KrishnaAggarwal2003

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this code, please cite it using the metadata below."
title: "Regression Algorithms from Scratch"
authors:
  - family-names: Aggarwal
    given-names: Krishna
    affiliation: Your Affiliation or University
date-released: 2025-05-28
version: "1.0.0"
repository-code: https://github.com/KrishnaAggarwal2003/Regression-Algorithms-from-Scratch
license: MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

regression-algorithms-from-scratch

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Regression Algorithms from Scratch

Contents

1. Linear Regression (`LR_code.ipynb`)

Overview

Key Steps

Output obtained from Test-data

2. Logistic Regression (`Logistic.ipynb`)

Overview

Key Steps

Output from Test-data

3. Educational Value

4. Requirements

```

5. License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

regression-algorithms-from-scratch

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Regression Algorithms from Scratch

Contents

1. Linear Regression (LR_code.ipynb)

Overview

Key Steps

Output obtained from Test-data

2. Logistic Regression (Logistic.ipynb)

Overview

Key Steps

Output from Test-data

3. Educational Value

4. Requirements

```

5. License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

1. Linear Regression (`LR_code.ipynb`)

2. Logistic Regression (`Logistic.ipynb`)