ohss-prediction-imbalanced-data

https://github.com/aricept094/ohss-prediction-imbalanced-data

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: medrxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: Aricept094
License: mit
Language: Python
Default Branch: main
Homepage: https://www.medrxiv.org/content/10.1101/2024.04.17.24305980v1
Size: 68.4 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Prediction of Complicated Ovarian Hyperstimulation Syndrome using Machine Learning

This repository contains the Python code for the study titled: "Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence".

Description

Ovarian Hyperstimulation Syndrome (OHSS) is a significant complication in Assisted Reproductive Technology (ART). Predicting complicated OHSS (moderate/severe cases requiring intervention) is challenging, particularly due to the highly imbalanced nature of clinical datasets.

This project implements a comprehensive machine learning framework designed to: 1. Predict the likelihood of complicated OHSS in patients undergoing infertility treatments. 2. Address severe class imbalance using advanced data augmentation techniques (specifically exploring variants from the smote-variants library). 3. Systematically optimize the entire prediction pipeline (preprocessing, feature selection, model selection, and hyperparameters) using Ray Tune. 4. Identify key clinical factors contributing to complicated OHSS risk using SHAP (SHapley Additive exPlanations).

The framework explores various ML models (including Logistic Regression, SVM, SGD, Ridge Regression, KNN, Tree-based models) and integrates them into an ensemble Voting Classifier. The optimization process aims to maximize recall for the minority class (complicated OHSS) while maintaining reasonable overall performance.

The best model identified in the associated study utilized IPADE-ID for data augmentation combined with an ensemble of Stochastic Gradient Descent, Support Vector Machine, and Ridge Regression classifiers, achieving high recall (0.9) for complicated OHSS prediction.

Related Publication

Title: Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence

Authors: Arash Ziaee¹, Hamed Khosravi², Tahereh Sadeghi³, Imtiaz Ahmed⁴, Maliheh Mahmoudinia⁵*

General Physician, Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
Department of Industrial & Management Systems Engineering, West Virginia University, Morgantown, WV, US
Assistant Professor of Nursing, Department of Pediatrics, School of Nursing and Midwifery, Nursing and Midwifery Care Research Center, Akbar Hospital, Mashhad University of Medical Sciences, Mashhad, Iran
Assistant Professor, Department of Industrial & Management Systems Engineering, West Virginia University, Morgantown, WV, US
Assistant Professor of Obstetrics & Gynecology, Fellowship of Infertility, Supporting the Family and the Youth of Population Research Core, Department of Obstetrics and Gynecology, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran - Coresponding Author

Status: (Currently Under Review / Link to Preprint) - (https://www.medrxiv.org/content/10.1101/2024.04.17.24305980v1)

Please cite the publication if you use this code or research findings.

Installation

Clone the repository: bash git clone https://github.com/Aricept094/OHSS-Prediction-Imbalanced-Data.git cd OHSS-Prediction-Imbalanced-Data
Create a virtual environment (Recommended): bash python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
Install dependencies: The required libraries are listed in requirements.txt. Install them using pip: bash pip install -r requirements.txt Key dependencies include: pandas, numpy, scikit-learn, ray[tune], smote-variants, optuna, xgboost, lightgbm, shap, joblib, pyarrow. (Ensure requirements.txt is generated and included in the repo).

Important Note on smote-variants:

To ensure reproducibility of the results presented in this work, a specific modification was required for the smote-variants library. The standard version (tested with version 1.0.0) contains an issue where the internal model_selection routine fails when evaluating parameter sets containing NumPy data types (e.g., np.False_, np.str_) due to its use of ast.literal_eval. therefore the requiremts.txt downloads a custom version with fix from the follwoing repo : https://github.com/Aricept094/smotevariantsliteralevalfixforRay.git@main#egg=smote_variants
Python Version: Developed and tested using Python 3.10

Data

IMPORTANT: Due to patient privacy regulations and the ethics approval (Mashhad University of Medical Sciences ethics committee IR.MUMS.REC.1395.326), the original patient dataset cannot be shared publicly in this repository.

The script (tune_ohss_pipeline.py) expects input data in the following format: * Two separate Excel files: train_data.xlsx and test_data.xlsx. * These files should be placed in a directory specified by the hardcoded paths train_data_path and test_data_path within the script (currently /home/[USEER_NAME]/my_data/). You will need to modify these paths. * Required Features: The input data files (train_data.xlsx, test_data.xlsx) must contain the target variable OHSS and all of the following features (column names), exactly as listed. The script will fail if any of these columns are missing or named differently. Descriptions are based on the associated study:

*   `age`: Patient's age at baseline (Years).
*   `weight`: Patient's weight at baseline (Kilograms).
*   `Height`: Patient's height at baseline (Centimeters).
*   `Durationinfertility`: Duration of infertility at baseline (Years or relevant time unit).
*   `FSH`: Follicle-Stimulating Hormone level at baseline (IU/L).
*   `LH`: Luteinizing Hormone level at baseline (IU/L).
*   `numberofcunsumptiondrug`: Total number of stimulation drug doses consumed during the cycle.
*   `durationofstimulation`: Duration of ovarian stimulation (Days).
*   `numberRfulicule`: Count of follicles in the right ovary on the day of hCG trigger.
*   `numberLfulicule`: Count of follicles in the left ovary on the day of hCG trigger.
*   `numberofoocyte`: Total count of oocytes retrieved on egg retrieval day.
*   `metagha1oocyte`: Count of Metaphase I (MI) oocytes retrieved.
*   `metaghaze2oocyte`: Count of Metaphase II (MII) oocytes retrieved (mature oocytes).
*   `necrozeoocyte`: Count of necrotic (dead) oocytes retrieved.
*   `lowQualityoocyte`: Count of oocytes deemed low quality upon retrieval.
*   `Gvoocyte`: Count of Germinal Vesicle (GV) oocytes retrieved (immature).
*   `macrooocyte`: Count of macro-oocytes (abnormally large) retrieved.
*   `partogenesoocyte`: Count of spontaneously activated / parthenogenetic oocytes observed at denudation.
*   `numberembrio`: Total count of embryos developed post-retrieval.
*   `countspermogram`: Sperm count/concentration from baseline spermogram analysis.
*   `motilityspermogram`: Percentage of motile sperm from baseline spermogram analysis (%).
*   `morfologyspermogram`: Percentage of sperm with normal morphology from baseline spermogram analysis (%).
*   `gradeembrio`: Quality grade assigned to the embryo(s) (Categorical: e.g., Grade 1, 2, 3).
*   `Typecycle`: Type of treatment protocol used (Categorical: e.g., GnRH Agonist, GnRH Antagonist).
*   `reasoninfertility`: Identified cause or source of infertility (Categorical: e.g., Female Factor, Male Factor, Both, Unexpected/Unexplained).
*   `Typeofcunsumptiondrug`: Specific type of stimulation drug used (Categorical: e.g., Cinnal-f, Gonal-f, hMG).
*   `typeoftrigger`: Type of drug used for final oocyte maturation trigger (Categorical: e.g., GnRH Agonist, hCG, Dual Trigger).
*   `Typedrug`: Type of Drug Regimen/Protocol Detail (Categorical).
*   `pregnancy`: History of previous pregnancy (Categorical: Positive/Negative).
*   `mense`: Regularity of menstrual cycle at baseline (Categorical: Regular/Irregular).
*   `Infertility`: Type of infertility at baseline (Categorical: Primary/Secondary).

Target Variable (OHSS): Must be present as a column. Encoded as 0 for Uncomplicated OHSS (corresponding to original categories 'Painless' or 'Mild' OHSS) and 1 for Complicated OHSS (corresponding to original categories 'Moderate' or 'Severe' OHSS).
Data Preprocessing: The script assumes the input data has already been handled for missing values (e.g., using imputation methods like those described in the paper - Random Forest for continuous, mean for categorical).

For testing purposes: You may want to create a small, synthetic, or anonymized dummy dataset following the expected structure and including all required column names (features + target) to run the script and verify its functionality.

Usage

Modify Paths: Open tune_ohss_pipeline.py and update the train_data_path, test_data_path, and log_directory_base variables to point to your data location and desired output directory.
Prepare Data: Ensure your train_data.xlsx and test_data.xlsx files are in the correct location and contain all the required features (with exact names and described content) listed in the "Data" section above, along with the OHSS target variable (encoded 0/1).
Run the Script: Execute the main script from your terminal within the activated virtual environment: bash python tune_ohss_pipeline.py
Process: This script will initiate a Ray Tune hyperparameter optimization process, running a large number of trials (defined by num_samples=15000) to find the best combination of preprocessing steps, feature subsets, data augmentation techniques (SMOTE variants), models, and hyperparameters based on the recall_mean metric.
Output:
- The script will create a main log directory (log_directory_base) named with the execution timestamp.
- Inside, it will create subdirectories for each Ray Tune trial.
- Each trial directory will contain logs, saved intermediate dataframes (.csv), configuration details, and potentially saved unfitted/fitted model files (.joblib).
- The console will output progress, and the best configuration found during the run will be printed at the end.

Reproducing Results

Running tune_ohss_pipeline.py executes the hyperparameter search framework described in the paper. The goal is to find high-performing pipeline configurations for predicting complicated OHSS.

The script explores the search_space defined within it.
The best configuration printed at the end represents the optimal pipeline found in that specific run, maximizing the custom recall_mean metric.
Due to the stochastic nature of the search and model training, the exact best configuration might vary slightly between runs.
The configuration reported in the paper (IPADE-ID augmentation + SGD/SVC/Ridge ensemble) was identified through this optimization process and represents one such high-performing result achieving Recall=0.9 for Class 1 and Accuracy=0.76.

Citation

If you use this code or the findings from the associated study in your research, please cite both the original paper and this software repository.

Paper: * Ziaee, A., Khosravi, H., Sadeghi, T., Ahmed, I., & Mahmoudinia, M. (Year). Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence. [Journal Name/Conference Proceedings, Volume, Pages, DOI - Update when published]

Software: * Please use the citation information provided in the CITATION.cff file or the "Cite this repository" button on GitHub.

License

This project is licensed under the MIT. See the LICENSE file for details.

Contact

For questions regarding the research or code, please contact the author: * Arash Ziaee : ziaeia961@mums.ac.ir

Owner

Name: Arash Ziaee
Login: Aricept094
Kind: user
Location: Iran

Repositories: 1
Profile: https://github.com/Aricept094

Medical Student at Mashhad University of Medical Sciences | MPH Student at Shiraz University of Medical Sciences | Former 3D4MEDICAL Student Ambassador | Part-T

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it using the metadata from this file."
title: "Code for: Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence"
version: 1.0.0 # Initial release version
doi: "[DOI for the software - e.g., via Zenodo - To be added]" # Optional but recommended
date-released: "[YYYY-MM-DD - e.g., 2024-05-21]"
authors:
  - given-names: "Arash"
    family-names: "Ziaee"
    affiliation: "Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran"
    # orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX" # Add ORCID if available
  - given-names: "Hamed"
    family-names: "Khosravi"
    affiliation: "Department of Industrial & Management Systems Engineering, West Virginia University, Morgantown, WV, US"
    # orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX"
  - given-names: "Tahereh"
    family-names: "Sadeghi"
    affiliation: "Department of Pediatrics, School of Nursing and Midwifery, Nursing and Midwifery Care Research Center, Akbar Hospital, Mashhad University of Medical Sciences, Mashhad, Iran"
    # orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX"
  - given-names: "Imtiaz"
    family-names: "Ahmed"
    affiliation: "Department of Industrial & Management Systems Engineering, West Virginia University, Morgantown, WV, US"
    # orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX"
  - given-names: "Maliheh"
    family-names: "Mahmoudinia"
    affiliation: "Supporting the Family and the Youth of Population Research Core, Department of Obstetrics and Gynecology, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran"
    # orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX"
    email: "mahmoudiniam@mums.ac.ir"
repository-code: "https://github.com/Aricept094/OHSS-Prediction-Imbalanced-Data.git"
keywords:
  - "Ovarian Hyperstimulation Syndrome"
  - "OHSS"
  - "Machine Learning"
  - "Assisted Reproductive Technology"
  - "ART"
  - "In Vitro Fertilization"
  - "IVF"
  - "Data Augmentation"
  - "Prediction"
  - "Ray Tune"
  - "SMOTE"
  - "Ensemble Learning"
license: "MIT"
# references: # Optional: Link to the paper itself
#   - type: article
#     title: "Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence"
#     authors: # Repeat authors list or simplify
#       - given-names: "Arash"
#         family-names: "Ziaee"
#       - given-names: "Hamed"
#         family-names: "Khosravi"
#       - given-names: "Tahereh"
#         family-names: "Sadeghi"
#       - given-names: "Imtiaz"
#         family-names: "Ahmed"
#       - given-names: "Maliheh"
#         family-names: "Mahmoudinia"
#     journal: "[Journal Name - To be added]"
#     doi: "[DOI of paper - To be added]"
#     year: "[Year - e.g., 2024]"

GitHub Events

Total

Push event: 2
Public event: 1

Last Year

Push event: 2
Public event: 1

Dependencies

requirements.txt pypi

GitPython ==3.1.44
Mako ==1.3.9
Markdown ==3.7
MarkupSafe ==3.0.2
MiniSom ==2.3.5
PyYAML ==6.0.2
Pygments ==2.19.1
SQLAlchemy ==2.0.40
Werkzeug ==3.1.3
absl-py ==2.2.1
aiohappyeyeballs ==2.6.1
aiohttp ==3.11.14
aiohttp-cors ==0.8.0
aiosignal ==1.3.2
alembic ==1.15.2
annotated-types ==0.7.0
anyio ==4.9.0
astunparse ==1.6.3
async-timeout ==5.0.1
attrs ==25.3.0
cachetools ==5.5.2
certifi ==2025.1.31
charset-normalizer ==3.4.1
click ==8.1.8
colorful ==0.5.6
colorlog ==6.9.0
contourpy ==1.3.1
cycler ==0.12.1
distlib ==0.3.9
docker-pycreds ==0.4.0
docutils ==0.21.2
et_xmlfile ==2.0.0
exceptiongroup ==1.2.2
fastapi ==0.115.12
filelock ==3.18.0
flatbuffers ==25.2.10
fonttools ==4.56.0
frozenlist ==1.5.0
fsspec ==2025.3.1
gast ==0.6.0
gitdb ==4.0.12
google-api-core ==2.24.2
google-auth ==2.38.0
google-pasta ==0.2.0
googleapis-common-protos ==1.69.2
greenlet ==3.1.1
grpcio ==1.71.0
h11 ==0.14.0
h5py ==3.13.0
httptools ==0.6.4
idna ==3.10
intel-cmplr-lib-ur ==2025.1.0
intel-openmp ==2025.1.0
joblib ==1.4.2
jsonschema ==4.23.0
jsonschema-specifications ==2024.10.1
keras ==3.9.1
kiwisolver ==1.4.8
libclang ==18.1.1
lightgbm ==4.6.0
markdown-it-py ==3.0.0
matplotlib ==3.10.1
mdurl ==0.1.2
metric-learn ==0.7.0
mkl ==2025.1.0
ml_dtypes ==0.5.1
msgpack ==1.1.0
multidict ==6.2.0
namex ==0.0.8
numpy ==2.1.3
nvidia-nccl-cu12 ==2.26.2
opencensus ==0.11.4
opencensus-context ==0.1.3
openpyxl ==3.1.5
opt_einsum ==3.4.0
optree ==0.14.1
optuna ==4.2.1
packaging ==24.2
pandas ==2.2.3
pillow ==11.1.0
platformdirs ==4.3.7
prometheus_client ==0.21.1
propcache ==0.3.1
proto-plus ==1.26.1
protobuf ==5.29.4
psutil ==7.0.0
py-spy ==0.4.0
pyarrow ==19.0.1
pyasn1 ==0.6.1
pyasn1_modules ==0.4.2
pydantic ==2.11.1
pydantic_core ==2.33.0
pyparsing ==3.2.3
python-dateutil ==2.9.0.post0
python-dotenv ==1.1.0
pytz ==2025.2
ray ==2.44.1
referencing ==0.36.2
requests ==2.32.3
rich ==14.0.0
rpds-py ==0.24.0
rsa ==4.9
scikit-learn ==1.6.1
scipy ==1.15.2
seaborn ==0.13.2
sentry-sdk ==2.24.1
setproctitle ==1.3.5
six ==1.17.0
smart-open ==7.1.0
smmap ==5.0.2
sniffio ==1.3.1
starlette ==0.46.1
statistics ==1.0.3.5
tbb ==2022.1.0
tcmlib ==1.3.0
tensorboard ==2.19.0
tensorboard-data-server ==0.7.2
tensorboardX ==2.6.2.2
tensorflow ==2.19.0
tensorflow-io-gcs-filesystem ==0.37.1
termcolor ==2.5.0
threadpoolctl ==3.6.0
tqdm ==4.67.1
typing-inspection ==0.4.0
typing_extensions ==4.13.0
tzdata ==2025.2
umf ==0.10.0
urllib3 ==2.3.0
uvicorn ==0.34.0
uvloop ==0.21.0
virtualenv ==20.29.3
wandb ==0.19.8
watchfiles ==1.0.4
websockets ==15.0.1
wrapt ==1.17.2
xgboost ==3.0.0
yarl ==1.18.3

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

ohss-prediction-imbalanced-data

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Prediction of Complicated Ovarian Hyperstimulation Syndrome using Machine Learning

Description

Related Publication

Installation

Data

Usage

Reproducing Results

Citation

License

Contact

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies