ohss-prediction-imbalanced-data
https://github.com/aricept094/ohss-prediction-imbalanced-data
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: medrxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Aricept094
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://www.medrxiv.org/content/10.1101/2024.04.17.24305980v1
- Size: 68.4 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Prediction of Complicated Ovarian Hyperstimulation Syndrome using Machine Learning
This repository contains the Python code for the study titled: "Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence".
Description
Ovarian Hyperstimulation Syndrome (OHSS) is a significant complication in Assisted Reproductive Technology (ART). Predicting complicated OHSS (moderate/severe cases requiring intervention) is challenging, particularly due to the highly imbalanced nature of clinical datasets.
This project implements a comprehensive machine learning framework designed to:
1. Predict the likelihood of complicated OHSS in patients undergoing infertility treatments.
2. Address severe class imbalance using advanced data augmentation techniques (specifically exploring variants from the smote-variants library).
3. Systematically optimize the entire prediction pipeline (preprocessing, feature selection, model selection, and hyperparameters) using Ray Tune.
4. Identify key clinical factors contributing to complicated OHSS risk using SHAP (SHapley Additive exPlanations).
The framework explores various ML models (including Logistic Regression, SVM, SGD, Ridge Regression, KNN, Tree-based models) and integrates them into an ensemble Voting Classifier. The optimization process aims to maximize recall for the minority class (complicated OHSS) while maintaining reasonable overall performance.
The best model identified in the associated study utilized IPADE-ID for data augmentation combined with an ensemble of Stochastic Gradient Descent, Support Vector Machine, and Ridge Regression classifiers, achieving high recall (0.9) for complicated OHSS prediction.
Related Publication
Title: Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence
Authors: Arash Ziaee¹, Hamed Khosravi², Tahereh Sadeghi³, Imtiaz Ahmed⁴, Maliheh Mahmoudinia⁵*
- General Physician, Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
- Department of Industrial & Management Systems Engineering, West Virginia University, Morgantown, WV, US
- Assistant Professor of Nursing, Department of Pediatrics, School of Nursing and Midwifery, Nursing and Midwifery Care Research Center, Akbar Hospital, Mashhad University of Medical Sciences, Mashhad, Iran
- Assistant Professor, Department of Industrial & Management Systems Engineering, West Virginia University, Morgantown, WV, US
- Assistant Professor of Obstetrics & Gynecology, Fellowship of Infertility, Supporting the Family and the Youth of Population Research Core, Department of Obstetrics and Gynecology, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran - Coresponding Author
Status: (Currently Under Review / Link to Preprint) - (https://www.medrxiv.org/content/10.1101/2024.04.17.24305980v1)
Please cite the publication if you use this code or research findings.
Installation
Clone the repository:
bash git clone https://github.com/Aricept094/OHSS-Prediction-Imbalanced-Data.git cd OHSS-Prediction-Imbalanced-DataCreate a virtual environment (Recommended):
bash python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`Install dependencies: The required libraries are listed in
requirements.txt. Install them using pip:bash pip install -r requirements.txtKey dependencies include:pandas,numpy,scikit-learn,ray[tune],smote-variants,optuna,xgboost,lightgbm,shap,joblib,pyarrow. (Ensurerequirements.txtis generated and included in the repo).Important Note on
smote-variants:To ensure reproducibility of the results presented in this work, a specific modification was required for the
smote-variantslibrary. The standard version (tested with version 1.0.0) contains an issue where the internalmodel_selectionroutine fails when evaluating parameter sets containing NumPy data types (e.g.,np.False_,np.str_) due to its use ofast.literal_eval. therefore the requiremts.txt downloads a custom version with fix from the follwoing repo : https://github.com/Aricept094/smotevariantsliteralevalfixforRay.git@main#egg=smote_variantsPython Version: Developed and tested using Python 3.10
Data
IMPORTANT: Due to patient privacy regulations and the ethics approval (Mashhad University of Medical Sciences ethics committee IR.MUMS.REC.1395.326), the original patient dataset cannot be shared publicly in this repository.
The script (tune_ohss_pipeline.py) expects input data in the following format:
* Two separate Excel files: train_data.xlsx and test_data.xlsx.
* These files should be placed in a directory specified by the hardcoded paths train_data_path and test_data_path within the script (currently /home/[USEER_NAME]/my_data/). You will need to modify these paths.
* Required Features: The input data files (train_data.xlsx, test_data.xlsx) must contain the target variable OHSS and all of the following features (column names), exactly as listed. The script will fail if any of these columns are missing or named differently. Descriptions are based on the associated study:
* `age`: Patient's age at baseline (Years).
* `weight`: Patient's weight at baseline (Kilograms).
* `Height`: Patient's height at baseline (Centimeters).
* `Durationinfertility`: Duration of infertility at baseline (Years or relevant time unit).
* `FSH`: Follicle-Stimulating Hormone level at baseline (IU/L).
* `LH`: Luteinizing Hormone level at baseline (IU/L).
* `numberofcunsumptiondrug`: Total number of stimulation drug doses consumed during the cycle.
* `durationofstimulation`: Duration of ovarian stimulation (Days).
* `numberRfulicule`: Count of follicles in the right ovary on the day of hCG trigger.
* `numberLfulicule`: Count of follicles in the left ovary on the day of hCG trigger.
* `numberofoocyte`: Total count of oocytes retrieved on egg retrieval day.
* `metagha1oocyte`: Count of Metaphase I (MI) oocytes retrieved.
* `metaghaze2oocyte`: Count of Metaphase II (MII) oocytes retrieved (mature oocytes).
* `necrozeoocyte`: Count of necrotic (dead) oocytes retrieved.
* `lowQualityoocyte`: Count of oocytes deemed low quality upon retrieval.
* `Gvoocyte`: Count of Germinal Vesicle (GV) oocytes retrieved (immature).
* `macrooocyte`: Count of macro-oocytes (abnormally large) retrieved.
* `partogenesoocyte`: Count of spontaneously activated / parthenogenetic oocytes observed at denudation.
* `numberembrio`: Total count of embryos developed post-retrieval.
* `countspermogram`: Sperm count/concentration from baseline spermogram analysis.
* `motilityspermogram`: Percentage of motile sperm from baseline spermogram analysis (%).
* `morfologyspermogram`: Percentage of sperm with normal morphology from baseline spermogram analysis (%).
* `gradeembrio`: Quality grade assigned to the embryo(s) (Categorical: e.g., Grade 1, 2, 3).
* `Typecycle`: Type of treatment protocol used (Categorical: e.g., GnRH Agonist, GnRH Antagonist).
* `reasoninfertility`: Identified cause or source of infertility (Categorical: e.g., Female Factor, Male Factor, Both, Unexpected/Unexplained).
* `Typeofcunsumptiondrug`: Specific type of stimulation drug used (Categorical: e.g., Cinnal-f, Gonal-f, hMG).
* `typeoftrigger`: Type of drug used for final oocyte maturation trigger (Categorical: e.g., GnRH Agonist, hCG, Dual Trigger).
* `Typedrug`: Type of Drug Regimen/Protocol Detail (Categorical).
* `pregnancy`: History of previous pregnancy (Categorical: Positive/Negative).
* `mense`: Regularity of menstrual cycle at baseline (Categorical: Regular/Irregular).
* `Infertility`: Type of infertility at baseline (Categorical: Primary/Secondary).
- Target Variable (
OHSS): Must be present as a column. Encoded as 0 for Uncomplicated OHSS (corresponding to original categories 'Painless' or 'Mild' OHSS) and 1 for Complicated OHSS (corresponding to original categories 'Moderate' or 'Severe' OHSS). - Data Preprocessing: The script assumes the input data has already been handled for missing values (e.g., using imputation methods like those described in the paper - Random Forest for continuous, mean for categorical).
For testing purposes: You may want to create a small, synthetic, or anonymized dummy dataset following the expected structure and including all required column names (features + target) to run the script and verify its functionality.
Usage
- Modify Paths: Open
tune_ohss_pipeline.pyand update thetrain_data_path,test_data_path, andlog_directory_basevariables to point to your data location and desired output directory. - Prepare Data: Ensure your
train_data.xlsxandtest_data.xlsxfiles are in the correct location and contain all the required features (with exact names and described content) listed in the "Data" section above, along with theOHSStarget variable (encoded 0/1). - Run the Script: Execute the main script from your terminal within the activated virtual environment:
bash python tune_ohss_pipeline.py - Process: This script will initiate a Ray Tune hyperparameter optimization process, running a large number of trials (defined by
num_samples=15000) to find the best combination of preprocessing steps, feature subsets, data augmentation techniques (SMOTE variants), models, and hyperparameters based on therecall_meanmetric. - Output:
- The script will create a main log directory (
log_directory_base) named with the execution timestamp. - Inside, it will create subdirectories for each Ray Tune trial.
- Each trial directory will contain logs, saved intermediate dataframes (
.csv), configuration details, and potentially saved unfitted/fitted model files (.joblib). - The console will output progress, and the best configuration found during the run will be printed at the end.
- The script will create a main log directory (
Reproducing Results
Running tune_ohss_pipeline.py executes the hyperparameter search framework described in the paper. The goal is to find high-performing pipeline configurations for predicting complicated OHSS.
- The script explores the
search_spacedefined within it. - The best configuration printed at the end represents the optimal pipeline found in that specific run, maximizing the custom
recall_meanmetric. - Due to the stochastic nature of the search and model training, the exact best configuration might vary slightly between runs.
- The configuration reported in the paper (IPADE-ID augmentation + SGD/SVC/Ridge ensemble) was identified through this optimization process and represents one such high-performing result achieving Recall=0.9 for Class 1 and Accuracy=0.76.
Citation
If you use this code or the findings from the associated study in your research, please cite both the original paper and this software repository.
Paper: * Ziaee, A., Khosravi, H., Sadeghi, T., Ahmed, I., & Mahmoudinia, M. (Year). Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence. [Journal Name/Conference Proceedings, Volume, Pages, DOI - Update when published]
Software:
* Please use the citation information provided in the CITATION.cff file or the "Cite this repository" button on GitHub.
License
This project is licensed under the MIT. See the LICENSE file for details.
Contact
For questions regarding the research or code, please contact the author:
* Arash Ziaee : ziaeia961@mums.ac.ir
Owner
- Name: Arash Ziaee
- Login: Aricept094
- Kind: user
- Location: Iran
- Repositories: 1
- Profile: https://github.com/Aricept094
Medical Student at Mashhad University of Medical Sciences | MPH Student at Shiraz University of Medical Sciences | Former 3D4MEDICAL Student Ambassador | Part-T
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it using the metadata from this file."
title: "Code for: Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence"
version: 1.0.0 # Initial release version
doi: "[DOI for the software - e.g., via Zenodo - To be added]" # Optional but recommended
date-released: "[YYYY-MM-DD - e.g., 2024-05-21]"
authors:
- given-names: "Arash"
family-names: "Ziaee"
affiliation: "Student Research Committee, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran"
# orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX" # Add ORCID if available
- given-names: "Hamed"
family-names: "Khosravi"
affiliation: "Department of Industrial & Management Systems Engineering, West Virginia University, Morgantown, WV, US"
# orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX"
- given-names: "Tahereh"
family-names: "Sadeghi"
affiliation: "Department of Pediatrics, School of Nursing and Midwifery, Nursing and Midwifery Care Research Center, Akbar Hospital, Mashhad University of Medical Sciences, Mashhad, Iran"
# orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX"
- given-names: "Imtiaz"
family-names: "Ahmed"
affiliation: "Department of Industrial & Management Systems Engineering, West Virginia University, Morgantown, WV, US"
# orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX"
- given-names: "Maliheh"
family-names: "Mahmoudinia"
affiliation: "Supporting the Family and the Youth of Population Research Core, Department of Obstetrics and Gynecology, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran"
# orcid: "https://orcid.org/XXXX-XXXX-XXXX-XXXX"
email: "mahmoudiniam@mums.ac.ir"
repository-code: "https://github.com/Aricept094/OHSS-Prediction-Imbalanced-Data.git"
keywords:
- "Ovarian Hyperstimulation Syndrome"
- "OHSS"
- "Machine Learning"
- "Assisted Reproductive Technology"
- "ART"
- "In Vitro Fertilization"
- "IVF"
- "Data Augmentation"
- "Prediction"
- "Ray Tune"
- "SMOTE"
- "Ensemble Learning"
license: "MIT"
# references: # Optional: Link to the paper itself
# - type: article
# title: "Prediction of Complicated Ovarian Hyperstimulation Syndrome in Assisted Reproductive Treatment Through Artificial Intelligence"
# authors: # Repeat authors list or simplify
# - given-names: "Arash"
# family-names: "Ziaee"
# - given-names: "Hamed"
# family-names: "Khosravi"
# - given-names: "Tahereh"
# family-names: "Sadeghi"
# - given-names: "Imtiaz"
# family-names: "Ahmed"
# - given-names: "Maliheh"
# family-names: "Mahmoudinia"
# journal: "[Journal Name - To be added]"
# doi: "[DOI of paper - To be added]"
# year: "[Year - e.g., 2024]"
GitHub Events
Total
- Push event: 2
- Public event: 1
Last Year
- Push event: 2
- Public event: 1
Dependencies
- GitPython ==3.1.44
- Mako ==1.3.9
- Markdown ==3.7
- MarkupSafe ==3.0.2
- MiniSom ==2.3.5
- PyYAML ==6.0.2
- Pygments ==2.19.1
- SQLAlchemy ==2.0.40
- Werkzeug ==3.1.3
- absl-py ==2.2.1
- aiohappyeyeballs ==2.6.1
- aiohttp ==3.11.14
- aiohttp-cors ==0.8.0
- aiosignal ==1.3.2
- alembic ==1.15.2
- annotated-types ==0.7.0
- anyio ==4.9.0
- astunparse ==1.6.3
- async-timeout ==5.0.1
- attrs ==25.3.0
- cachetools ==5.5.2
- certifi ==2025.1.31
- charset-normalizer ==3.4.1
- click ==8.1.8
- colorful ==0.5.6
- colorlog ==6.9.0
- contourpy ==1.3.1
- cycler ==0.12.1
- distlib ==0.3.9
- docker-pycreds ==0.4.0
- docutils ==0.21.2
- et_xmlfile ==2.0.0
- exceptiongroup ==1.2.2
- fastapi ==0.115.12
- filelock ==3.18.0
- flatbuffers ==25.2.10
- fonttools ==4.56.0
- frozenlist ==1.5.0
- fsspec ==2025.3.1
- gast ==0.6.0
- gitdb ==4.0.12
- google-api-core ==2.24.2
- google-auth ==2.38.0
- google-pasta ==0.2.0
- googleapis-common-protos ==1.69.2
- greenlet ==3.1.1
- grpcio ==1.71.0
- h11 ==0.14.0
- h5py ==3.13.0
- httptools ==0.6.4
- idna ==3.10
- intel-cmplr-lib-ur ==2025.1.0
- intel-openmp ==2025.1.0
- joblib ==1.4.2
- jsonschema ==4.23.0
- jsonschema-specifications ==2024.10.1
- keras ==3.9.1
- kiwisolver ==1.4.8
- libclang ==18.1.1
- lightgbm ==4.6.0
- markdown-it-py ==3.0.0
- matplotlib ==3.10.1
- mdurl ==0.1.2
- metric-learn ==0.7.0
- mkl ==2025.1.0
- ml_dtypes ==0.5.1
- msgpack ==1.1.0
- multidict ==6.2.0
- namex ==0.0.8
- numpy ==2.1.3
- nvidia-nccl-cu12 ==2.26.2
- opencensus ==0.11.4
- opencensus-context ==0.1.3
- openpyxl ==3.1.5
- opt_einsum ==3.4.0
- optree ==0.14.1
- optuna ==4.2.1
- packaging ==24.2
- pandas ==2.2.3
- pillow ==11.1.0
- platformdirs ==4.3.7
- prometheus_client ==0.21.1
- propcache ==0.3.1
- proto-plus ==1.26.1
- protobuf ==5.29.4
- psutil ==7.0.0
- py-spy ==0.4.0
- pyarrow ==19.0.1
- pyasn1 ==0.6.1
- pyasn1_modules ==0.4.2
- pydantic ==2.11.1
- pydantic_core ==2.33.0
- pyparsing ==3.2.3
- python-dateutil ==2.9.0.post0
- python-dotenv ==1.1.0
- pytz ==2025.2
- ray ==2.44.1
- referencing ==0.36.2
- requests ==2.32.3
- rich ==14.0.0
- rpds-py ==0.24.0
- rsa ==4.9
- scikit-learn ==1.6.1
- scipy ==1.15.2
- seaborn ==0.13.2
- sentry-sdk ==2.24.1
- setproctitle ==1.3.5
- six ==1.17.0
- smart-open ==7.1.0
- smmap ==5.0.2
- sniffio ==1.3.1
- starlette ==0.46.1
- statistics ==1.0.3.5
- tbb ==2022.1.0
- tcmlib ==1.3.0
- tensorboard ==2.19.0
- tensorboard-data-server ==0.7.2
- tensorboardX ==2.6.2.2
- tensorflow ==2.19.0
- tensorflow-io-gcs-filesystem ==0.37.1
- termcolor ==2.5.0
- threadpoolctl ==3.6.0
- tqdm ==4.67.1
- typing-inspection ==0.4.0
- typing_extensions ==4.13.0
- tzdata ==2025.2
- umf ==0.10.0
- urllib3 ==2.3.0
- uvicorn ==0.34.0
- uvloop ==0.21.0
- virtualenv ==20.29.3
- wandb ==0.19.8
- watchfiles ==1.0.4
- websockets ==15.0.1
- wrapt ==1.17.2
- xgboost ==3.0.0
- yarl ==1.18.3