ads503_project_g1

GitHub Repository for University of San Diego's Applied Data Science Program. Summer 2025. Group 1 Members: Jun Clemente, Darren Chen, Graham Ward

https://github.com/gw-00/ads503_project_g1

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

GitHub Repository for University of San Diego's Applied Data Science Program. Summer 2025. Group 1 Members: Jun Clemente, Darren Chen, Graham Ward

Basic Info
  • Host: GitHub
  • Owner: gw-00
  • Default Branch: main
  • Size: 519 KB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 8 months ago
Metadata Files
Readme Citation

README.md

Machine Learning Beyond the Cleveland Dataset: Cross-Cohort Coronary Disease Prediction using Expanded Clinical Features

ADS503 - Applied Predictive Modeling

Team 1

Installation

To get started with this project, please clone the repository into your local machine using the commands below:

```{bash}

git clone https://github.com/gw-00/ads503projectg1.git
cd ads503projectg1 ```

Contributors

Methods

  • Pre-processing
  • Exploratory Data Analysis
  • Data visualization
  • Statistical Modeling and Machine Learning
    • Logistic Regression
    • Random Forest
    • PLS Discriminant Analysis
    • k-Nearest Neighbors
    • Penalized Regression
      • Lasso Penalization
      • Ridge Penalization
      • Elastic Net Model

Technologies

  • RStudio
  • Quarto
  • R
  • Generative AI
    • ChatGPT

Abstract

The goal of this study was to develop a predictive model for identifying coronary heart disease using patient data from four different medical centers around the globe. Leveraging a complete 76-feature heart disease data set from the UCI Machine Learning Repository, records from the Veterans Administration in Long Beach, the Hungarian Insititute of Cardiology, the University Hospital in Zurich, and the Cleveland Clinic underwent merging, pre-processing, and then underwent rigorous modeling. A comprehensive exploratory data analysis (EDA), data cleaning process, and imputation procedures were performed to handle extensive missing values and features with high correlations to avoid impacting model performance and minimizing the amount of bias and variance the models produce. Multiple classification models were developed to include Logistic Regression, Random Forest, Partial Least Squares Discriminant Analysis (PLS-DA), K-Nearest Neighbors (KNN), Penalized Logistic Regression (Lasso, Ridge, and ElasticNet).

Problem Statement

Previously, predictive model development for coronary heart disease has focused on simplified data sets of 14 features and typically have centered around performing the work on just the Cleveland subset of data. These previous approaches offer the benefit of accessibility and a complete data set for modeling purposes but omit 62 potential valuable predictor information from the entire data set.

Goal

Enhance predictive accuracy of coronary heart disease by employing a richer and detailed feature set, which will lead to improved performance metrics across the multiple classification machine learning algorithms developed

Non-goals

  1. Individual Health Tracking: Data collected will not involve personally identifiable health data.
  2. Medical or Clinical Recommendations: Medical treatments, vaccination protocols, or individual health intervention will not be prescribed or evaluated.

Data Sources

Acknowledgements

Portions of this codebase and documentation were developed with assistance from Generative AI, ChatGPT (OpenAI), June 2025.

References

Presentations and Projects

  1. Project Presentation:
  2. Project Slides:
  3. Document Link:
  4. Project Repo: https://github.com/gw-00/ads503_project_g1

Owner

  • Name: Graham
  • Login: gw-00
  • Kind: user

GitHub Events

Total
  • Watch event: 1
  • Delete event: 9
  • Issue comment event: 4
  • Member event: 2
  • Push event: 42
  • Pull request review event: 1
  • Pull request event: 19
  • Fork event: 2
  • Create event: 13
Last Year
  • Watch event: 1
  • Delete event: 9
  • Issue comment event: 4
  • Member event: 2
  • Push event: 42
  • Pull request review event: 1
  • Pull request event: 19
  • Fork event: 2
  • Create event: 13

Dependencies

requirements.txt pypi