ml_classification

AI course assignment

https://github.com/ivanstarostin1984/ml_classification

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.0%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

AI course assignment

Basic Info

Host: GitHub
Owner: IvanStarostin1984
License: mit
Language: Python
Default Branch: main
Size: 770 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 9 months ago · Last pushed 9 months ago

Metadata Files

Readme Changelog License Citation

ML_classification

A tidy, production-ready re-implementation of my Google Colab notebook for predicting loan approvals with logistic regression, decision-tree and random-forest pipelines.

[ ROC-AUC 0.987 ± 0.008 ]

What’s inside & why it matters

End-to-end pipeline – data download, cleaning, 80 + engineered features, rigorous feature selection, model tuning, and statistical evaluation.
Statistical transparency – every performance number is reported with 95 % bootstrap confidence intervals and a fairness check (four-fifths rule).
Clean architecture – each stage is its own Python module under src/, ready for unit tests and continuous integration.
One-command reproducibility – make train or run the Docker image from the provided Dockerfile to train the models and regenerate all artefacts.
CI/CD ready – GitHub Actions lint + pytest on every push.
Modular utilities – feature engineering and diagnostics are available as importable helpers. Helpers like split.random_split, split.time_split and utils.set_seeds simplify experiments.

The rendered documentation lives at https://ivanstarostin1984.github.io/ML_classification.

See CHANGELOG.md for release notes.

Quick-start

```bash

Clone the repo

git clone https://github.com/IvanStarostin1984/MLclassification.git cd MLclassification

Set up the environment

pip install -r requirements.txt

or: conda env create -f environment.yml

Install the project in editable mode for development

pip install -e .

Enable automatic formatting on commits

pip install pre-commit pre-commit install

Running pre-commit needs network access or a `GIT_TOKEN` with

at least the `public_repo` scope. CI uses this token to clone the hooks,

so set it as a repository secret.

The hooks run `isort` before `black` and `flake8` so imports stay ordered.

In CI the workflow runs `pre-commit run --files` on changed files before

`flake8`, `black` and `pytest`.

This registers the `src` package so scripts like

`python scripts/download_data.py` can import it.

If you used conda, activate the environment

conda activate ml-classification

Provide your Kaggle API token before downloading the dataset.

Either place `kaggle.json` under `~/.kaggle/` or export the

`KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables:

export KAGGLEUSERNAME=yourusername

export KAGGLEKEY=yourkey

Download the Kaggle dataset

python scripts/download_data.py

The raw CSVs land in `data/raw/` (git-ignored).

A `.sha256` file keeps the checksum so the script skips re-downloading

if the dataset hasn't changed.

Train, evaluate and store artefacts in artefacts/

make train # run both models make eval # evaluate trained models and check fairness

or individually

make train-logreg make train-cart mlcls-train --model randomforest # train only the RF model mlcls-train --model randomforest -g # grid search for the RF model mlcls-train --model gboost # train the gradient boosting model mlcls-train --model gboost -g # grid search for gradient boosting mlcls-train --model svm # train the support vector machine mlcls-train --model svm -g # grid search for SVM mlcls-train --sampler smote # run with SMOTE oversampling ```

Pre-commit hooks format code and lint Markdown automatically on each commit. They run isort, black and flake8 when you commit.

Create a personal access token from GitHub Settings > Developer settings

Personal access tokens. Add this token under Settings > Secrets as GIT_TOKEN. CI reads this secret so pre-commit can clone hook repositories.

See data/README.md for dataset licence notes.

Interactive notebooks live under notebooks/. Open loan_demo.ipynb or advanced_demo.ipynb for a guided walkthrough. You can also launch them instantly on Binder via the badge in notebooks/README.md. Binder sessions do not ship with the Kaggle dataset and cannot download it without credentials.

Training produces feature-importance tables (logreg_coefficients.csv, cart_importances.csv) and bar-chart PNGs in artefacts/. All generated files are recorded in artefacts/SHA256_manifest.txt for reproducibility. Pass a DataFrame to logreg_coefficients or tree_feature_importances along with shap_csv_path to save SHAP value tables as well. Use plot_shap_summary to turn those values into a PNG stored in artefacts/.

make eval runs python -m src.evaluate to compute test metrics and the worst four-fifths ratio across protected groups (pass --group-col to override the default). Metrics are stored in artefacts/summary_metrics.csv and printed to stdout. A ratio below 0.8 warns of possible bias. You can replicate the notebook's exhaustive cross-validation using the training command with --grid-search (or -g):

bash mlcls-train --grid-search # repeated CV with extended parameter grids

This run takes longer but mirrors the notebook results.

For fairness evaluation and calibration instructions see docs/advanced_usage.rst.

Running tests

Execute the test-suite locally with:

bash make test

This sets PYTHONPATH so pytest can find the src package.

Local testing

Install the requirements before running the tests:

```bash pip install -r requirements.txt

or: conda env create -f environment.yml

```

Building the docs

Install Sphinx from requirements.txt or environment.yml first:

bash pip install -r requirements.txt # or conda env update -f environment.yml

Then generate HTML pages:

bash make docs # or cd docs && sphinx-build -b html . _build

The output appears under docs/_build/.

Use make lint-docs to check Markdown files.

A personal access token with the contents:write scope must be stored in the GH_PAGES_TOKEN repository secret so the docs workflow can push to the gh-pages branch. Without this secret the gh-pages job fails with "not found deploy key or tokens".

Building a wheel

Install the build tool and run:

bash python -m pip install build python -m build

The wheel lands in dist/.

Tagged releases run the same build in CI and attach the wheel to a GitHub release. Tag a commit with git tag v1.2.3 and push it to trigger the upload to PyPI via twine.

Command-line usage

After installing the project in editable mode you get these console commands:

bash pip install -e . mlcls-train # trains both models mlcls-train --model random_forest -g # extensive grid search mlcls-train --model gboost -g # gradient boosting grid search mlcls-train --model svm -g # SVM grid search mlcls-eval --threshold 0.6 # sets fairness metric cutoff mlcls-predict # generates predictions from a saved model mlcls-report # collects report artifacts mlcls-manifest # writes checksums for selected files mlcls-summary # prints dataset statistics

Example usage:

bash mlcls-summary --data-path data/raw/loan_approval_dataset.csv

These commands require the Kaggle dataset, which is distributed under its original licence. See data/README.md for details. The dataset is small – around 380 kB (~1000 rows) – so the default training run finishes in a few seconds. Pass -g to mlcls-train to perform the extensive grid search (5×3 cross-validation) used in the original notebook. See docs/cli_usage.rst for a walkthrough of these commands.

Prefer Docker?

bash docker build -t ml_classification . docker run --rm \ -e KAGGLE_USERNAME=$KAGGLE_USERNAME \ -e KAGGLE_KEY=$KAGGLE_KEY ml_classification

Model calibration

Run the calibration helper after training to create reliability plots:

bash python -m src.calibration

This saves logreg_calibration.png and cart_calibration.png (plus *_calibrated.joblib models) in artefacts/.

Repository layout

The project follows the target directory layout. Running make train now executes both the logistic regression and decision-tree pipelines located under src/models.

text legacy/ai_arisha.py ← legacy Colab script (read-only) AGENTS.md ← contributor guidelines and architecture notes .github/workflows/ci.yml ← CI pipeline (Black, flake8, pytest) scripts/download_data.py ← Kaggle dataset pull helper src/ ← Python package skeleton src/models/logreg.py ← logistic regression pipeline src/models/cart.py ← decision-tree pipeline src/models/random_forest.py ← random-forest pipeline src/models/gradient_boosting.py ← gradient boosting pipeline src/features.py ← FeatureEngineer class src/diagnostics.py ← chi-square & correlation plots src/preprocessing.py ← ColumnTransformer helpers src/selection.py ← VIF & tree-based selector src/calibration.py ← probability calibration CLI src/evaluation_utils.py ← evaluation helpers src/cv_utils.py ← cross-validation utilities src/manifest.py ← SHA-256 manifest writer src/feature_importance.py← importance tables src/pipeline_helpers.py ← grid-search utilities src/reporting.py ← report assembly helpers src/diagnostics_stats.py ← stats for diagnostics src/utils.py ← general helpers tests/ ← pytest suite data/README.md ← dataset licence notes notebooks/README.md ← Colab/Binder demo stub binder/environment.yml ← Binder spec binder/postBuild ← install step Dockerfile, Makefile ← reproducible build & workflow helpers environment.yml ← Conda spec (Python ≥ 3.10) pyproject.toml ← project build metadata requirements.txt ← pip fallback LICENSE ← MIT README.md ← you are here

Key results (hold-out test set)

| Model | ROC-AUC | PR-AUC | Fairness (4/5 rule) | | ------------------- | :-------: | :----: | ---------------------- | | Logistic Regression | 0.987 | 0.991 | Pass (gender, marital) | | Decision Tree | 0.961 | 0.972 | Pass (gender, marital) |

Values reproduced from the accompanying statistical report.

How to cite

bibtex @misc{Starostin2025LoanApproval, author = {Ivan Starostin}, title = {ML\_classification: Loan-approval prediction pipelines}, year = {2025}, url = {https://github.com/IvanStarostin1984/ML_classification} }

See CITATION.cff for other citation formats.

Author

Ivan Starostin – LinkedIn

Owner

Name: Ivan Starostin
Login: IvanStarostin1984
Kind: user
Location: Potsdam
Company: University of Europe for Applied Sciences

Repositories: 1
Profile: https://github.com/IvanStarostin1984

MD, PhD, software engineering student

Citation (CITATION.cff)

cff-version: 1.2.0
message: 'If you use this work, please cite it as below.'
authors:
  - family-names: Starostin
    given-names: Ivan
title: 'ML_classification: Loan-approval prediction pipelines'
year: 2025
url: 'https://github.com/IvanStarostin1984/ML_classification'

GitHub Events

Total

Delete event: 15
Push event: 52
Public event: 1
Pull request event: 30
Create event: 16

Last Year

Delete event: 15
Push event: 52
Public event: 1
Pull request event: 30
Create event: 16

Dependencies

.github/workflows/ci.yml actions

actions/checkout v3 composite
actions/setup-node v4 composite
actions/setup-python v4 composite
actions/upload-artifact v4 composite
rhysd/actionlint v1.7.7 composite
tj-actions/changed-files v41 composite

.github/workflows/gh-pages.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
peaceiris/actions-gh-pages v3 composite

.github/workflows/release.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
softprops/action-gh-release v1 composite

Dockerfile docker

python 3.10-slim build

binder/environment.yml pypi

environment.yml pypi

pyproject.toml pypi

imbalanced-learn *
joblib *
kaggle *
matplotlib *
numpy *
pandas *
scikit-learn *
scipy *
seaborn *
statsmodels *

requirements.txt pypi

black *
flake8 *
imbalanced-learn *
joblib *
kaggle *
matplotlib *
numpy *
pandas *
pytest *
scikit-learn *
scipy *
seaborn *
shap *
sphinx *
statsmodels *

ml_classification

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

ML_classification

What’s inside & why it matters

Quick-start

Clone the repo

Set up the environment

or: conda env create -f environment.yml

Install the project in editable mode for development

Enable automatic formatting on commits

Running pre-commit needs network access or a GIT_TOKEN with

at least the public_repo scope. CI uses this token to clone the hooks,

so set it as a repository secret.

The hooks run isort before black and flake8 so imports stay ordered.

In CI the workflow runs pre-commit run --files on changed files before

flake8, black and pytest.

This registers the src package so scripts like

python scripts/download_data.py can import it.

If you used conda, activate the environment

Provide your Kaggle API token before downloading the dataset.

Either place kaggle.json under ~/.kaggle/ or export the

KAGGLE_USERNAME and KAGGLE_KEY environment variables:

export KAGGLEUSERNAME=yourusername

export KAGGLEKEY=yourkey

Download the Kaggle dataset

The raw CSVs land in data/raw/ (git-ignored).

A .sha256 file keeps the checksum so the script skips re-downloading

if the dataset hasn't changed.

Train, evaluate and store artefacts in artefacts/

or individually

Running tests

Local testing

or: conda env create -f environment.yml

Building the docs

Building a wheel

Command-line usage

Model calibration

Repository layout

Key results (hold-out test set)

How to cite

Author

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

Running pre-commit needs network access or a `GIT_TOKEN` with

at least the `public_repo` scope. CI uses this token to clone the hooks,

The hooks run `isort` before `black` and `flake8` so imports stay ordered.

In CI the workflow runs `pre-commit run --files` on changed files before

`flake8`, `black` and `pytest`.

This registers the `src` package so scripts like

`python scripts/download_data.py` can import it.

Either place `kaggle.json` under `~/.kaggle/` or export the

`KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables:

The raw CSVs land in `data/raw/` (git-ignored).

A `.sha256` file keeps the checksum so the script skips re-downloading