https://github.com/datasig-ac-uk/sepsis_label_extraction
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: datasig-ac-uk
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Size: 18.1 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Subtle Variation in Sepsis-III Definitions Influences Predictive Performance of Machine Learning
The early detection of sepsis is a key research priority to help facilitate timely intervention. Criteria used to identify the onset time of sepsis from health records vary, hindering comparison and progress in this field. We considered the effects of variations in sepsis onset definition on the predictive performance of three representative models (i.e. Light gradient boosting machine (LGBM), Long short term memory (LSTM) and Cox proportional-hazards models (CoxPHM)) for early sepsis detection.
This repository is the official implementation of the paper entitled "Subtle Variation of Sepsis-III Definitions Influences Predictive Performance of Machine Learning".
This repository contains code for the following parts in our experimental pipeline: 1. Extracting the sepsis labelling from the MIMIC-III data based on three sepsis criteria H1-3 and their variants (see src/database) 2. Training three types of models (i.e. LGBM, LSTM and CoxPHM) for the early sepsis prediction on the datasets produced in Step 1. (see src/models) 3. Evaluating each trained model using the test metrics (e.g. AUROC) and producing the visualization plots (see src/visualization)
Environment Setup
The code has been tested successfully using Python 3.7; thus we suggest using this version or a later version of Python. A typical process for installing the package dependencies involves creating a new Python virtual environment.
To install the required packages, run the following:
console
pip install -r requirements.txt
Finally, to prepare the environment for running the code, run the following:
console
source pythonpath.sh
Data Extraction Pipeline
To train and evaluate our models, we will change the relational format of the MIMIC-III database to a pivoted view which includes key demographic information, vital signs, and laboratory readings. We will also create tables for the possible sepsis onset times of each patient. We will subsequently output the pivoted data to comma-separated value (CSV) files, which serve as input for model training and evaluation.
Prior to running any of the data extraction commands, make sure to change to the src/database subdirectory:
console
cd src/database
Next, please follow the instructions in the data extraction README.md. (Depending on your preferred choice of installing PostgreSQL on your machine yourself or using a Docker container, please follow the relevant sections in the data extraction README.md.)
Model Training and Testing Pipeline
Feature Extraction
To generate the derived features mentioned in our paper, simply run the following:
console
python3 src/features/generate_features.py
The preceding command will save features required for model training/tuning/evaluation to data/processed.
Model tuning/training/evaluation
Initiate model tuning, training and evaluation using the main.py script. This script takes four optional arguments: --model, --process, --n_cpus, and --n_gpus:
console
python3 src/models/main.py --model MODEL_NAME --step STEP_NAME --n_cpus N_CPUS --n_gpus N_GPUS
where MODEL_NAME is either LGBM, LSTM, or CoxPHM and where STEP_NAME is either tune train, or eval. Furthermore, N_CPUS is the number of CPUs and N_GPUs is the number of GPUs.
For each of the three models (LGBM, LSTM, and CoxPHM), the required sequence of steps is tune, train, eval:
1. tune: For a given model, running the tuning step computes and saves optimal hyperparameters for subsequent training and evaluation.
2. train: The model is trained and saved to the model/ directory for subsequent evaluation.
3. eval: Evaluation involves generating numerical results and predictions, which are respectively saved to outputs/results and outputs/predictions.
Note: To run all three above steps in the required order for all three models on 1 CPU and on 1 GPU, simply run main.py without any arguments, i.e.
console
python3 src/models/main.py
The full pipeline could takes several days to complete, you can also download our pretrained model and obtain the results directly by the following commands:
console
bash pretrained_models.sh
python3 src/models/main.py --model MODEL_NAME --step eval
Visualizations
To reproduce all the plots in the paper, after having run the model evaluation step run the following command:
console
python3 src/visualization/main_plots.py
Owner
- Name: DataSig
- Login: datasig-ac-uk
- Kind: organization
- Website: https://datasig.web.ox.ac.uk/
- Repositories: 3
- Profile: https://github.com/datasig-ac-uk
A rough path between mathematics and data science
GitHub Events
Total
- Push event: 2
- Create event: 2
Last Year
- Push event: 2
- Create event: 2
Dependencies
- postgres latest build
- postgres latest build
- aiosignal 1.3.2
- attrs 25.1.0
- autograd 1.7.0
- autograd-gamma 0.5.0
- beautifulsoup4 4.13.3
- certifi 2025.1.31
- charset-normalizer 3.4.1
- click 8.1.8
- colorama 0.4.6
- contourpy 1.3.1
- cycler 0.12.1
- dill 0.3.9
- filelock 3.17.0
- fonttools 4.56.0
- formulaic 1.1.1
- frozenlist 1.5.0
- fsspec 2025.2.0
- gdown 5.2.0
- idna 3.10
- iisignature 0.24
- interface-meta 1.2.5
- interface-meta 1.3.0
- jinja2 3.1.5
- joblib 1.4.2
- jsonschema 4.23.0
- jsonschema-specifications 2024.10.1
- kiwisolver 1.4.8
- lifelines 0.30.0
- lightgbm 4.6.0
- markupsafe 3.0.2
- matplotlib 3.10.0
- matplotlib-venn 1.1.2
- mpmath 1.3.0
- msgpack 1.1.0
- networkx 3.4.2
- numpy 2.2.3
- nvidia-cublas-cu12 12.4.5.8
- nvidia-cuda-cupti-cu12 12.4.127
- nvidia-cuda-nvrtc-cu12 12.4.127
- nvidia-cuda-runtime-cu12 12.4.127
- nvidia-cudnn-cu12 9.1.0.70
- nvidia-cufft-cu12 11.2.1.3
- nvidia-curand-cu12 10.3.5.147
- nvidia-cusolver-cu12 11.6.1.9
- nvidia-cusparse-cu12 12.3.1.170
- nvidia-cusparselt-cu12 0.6.2
- nvidia-nccl-cu12 2.21.5
- nvidia-nvjitlink-cu12 12.4.127
- nvidia-nvtx-cu12 12.4.127
- packaging 24.2
- pandas 2.2.3
- pillow 11.1.0
- protobuf 5.29.3
- pyparsing 3.2.1
- pysocks 1.7.1
- python-dateutil 2.9.0.post0
- pytz 2025.1
- pyyaml 6.0.2
- ray 2.42.1
- referencing 0.36.2
- requests 2.32.3
- rpds-py 0.23.1
- scikit-learn 1.6.1
- scipy 1.15.2
- seaborn 0.13.2
- setuptools 75.8.1
- six 1.17.0
- soupsieve 2.6
- sympy 1.13.1
- threadpoolctl 3.5.0
- torch 2.6.0
- tqdm 4.67.1
- triton 3.2.0
- typing-extensions 4.12.2
- tzdata 2025.1
- urllib3 2.3.0
- wrapt 1.17.2
- dill (>=0.3.9,<0.4.0)
- gdown (>=5.2.0,<6.0.0)
- iisignature @ git+https://github.com/bottler/iisignature.git
- joblib (>=1.4.2,<2.0.0)
- lifelines (>=0.30.0,<0.31.0)
- lightgbm (>=4.6.0,<5.0.0)
- matplotlib (>=3.10.0,<4.0.0)
- matplotlib-venn (>=1.1.2,<2.0.0)
- numpy (>=2.2.3,<3.0.0)
- pandas (>=2.2.3,<3.0.0)
- pillow (>=11.1.0,<12.0.0)
- ray (>=2.42.1,<3.0.0)
- requests (>=2.32.3,<3.0.0)
- scikit-learn (>=1.6.1,<2.0.0)
- scipy (>=1.15.2,<2.0.0)
- seaborn (>=0.13.2,<0.14.0)
- torch (>=2.6.0,<3.0.0)
- dill ==0.3.1.1
- gdown ==3.13.0
- iisignature ==0.24
- joblib ==0.14.0
- lifelines ==0.26.0
- lightgbm ==2.3.1
- matplotlib ==3.1.3
- matplotlib_venn ==0.11.6
- numpy ==1.17.4
- pandas ==1.2.4
- pillow ==8.3.1
- ray ==0.8.6
- requests ==2.26.0
- scikit_learn ==0.23.2
- scipy ==1.3.3
- seaborn ==0.11.1
- torch ==1.6.0
- matplotlib ==3.4.2
- numpy ==1.20.3
- pandas ==1.2.4
- psycopg2 ==2.8.6
- scikit-learn ==0.24.2