transform-emr

This model is a decoder transformer based model aiming to model events predictions from EMR records as a sequential text generation problem. This project is a part of my thesis research.

https://github.com/shaharoded/transform-emr

Keywords

medical-informatics pretraining tokenization transformer-architecture

Last synced: 6 months ago · JSON representation ·

Repository

This model is a decoder transformer based model aiming to model events predictions from EMR records as a sequential text generation problem. This project is a part of my thesis research.

Basic Info

Host: GitHub
Owner: shaharoded
License: other
Language: Python
Default Branch: main
Homepage:
Size: 10.9 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

medical-informatics pretraining tokenization transformer-architecture

Created 11 months ago · Last pushed 6 months ago

Metadata Files

Readme License Citation

Event Prediction in EMRs

This repository implements a two-phase deep learning pipeline for modeling longitudinal Electronic Medical Records (EMRs). The architecture combines temporal embeddings, patient context, and Transformer-based sequence modeling to predict or impute patient events over time.

This repo is part of an unpublished thesis and will be finalized post-submission. Please do not reuse without permission.

The results shown here (in evaluation.ipynb) are on random data, as my research dataset is private. This model will be used on actual EMR data, stored in a closed environment. For that, it is organized as a package that can be installed:

bash event-prediction-in-diabetes-care/ │ ├── transform_emr/ # Core Python package │ ├── config/ # Configuration modules │ │ ├── __init__.py │ │ ├── dataset_config.py │ │ └── model_config.py │ │ │ ├── __init__.py │ ├── dataset.py # Dataset, DataPreprocess and Tokenizer │ ├── embedder.py # Embedding model (EMREmbedding) + training │ ├── transformer.py # Transformer architecture (GPT) + training │ ├── train.py # Full training pipeline (2-phase) │ ├── inference.py # Inference pipeline │ ├── evaluation.ipynb # Evaluation notebook │ ├── loss.py # Utility module for special loss (auxillary) criterias │ ├── utils.py # Utility functions for the package (plots + penalties) │ └── debug_tools.py # Debug loop for epochs (logits) │ ├── data/ # External data folder (for synthetic or real EMR) │ ├── generate_synthetic_data.ipynb # A notebook that generates synthetic data similar in structure to original (for tests) │ ├── train/ │ └── test/ │ ├── unittests/ # Unit and integration tests (dataset / model / utils) │ ├── .gitignore ├── requirements.txt ├── LICENCE ├── CITATION.cff ├── setup.py ├── pyproject.toml └── README.md

🛠️ Installation

Install the project as an editable package from the root directory:

```bash pip install -e .

Ensure your working directory is properly set to the root repo of this project

Be sure to set the path in your local env properly.

```

🚀 Usage

1. Prepare Dataset and Update Config

```python import pandas as pd from transformemr.dataset import EMRDataset from transformemr.config.datasetconfig import * from transformemr.config.model_config import *

Load data (verify you paths are properly defined)

temporaldf = pd.readcsv(TRAINTEMPORALDATAFILE, lowmemory=False) ctxdf = pd.readcsv(TRAINCTXDATA_FILE)

print(f"[Pre-processing]: Building tokenizer...") processor = DataProcessor(temporaldf, ctxdf, scaler=None) temporaldf, ctxdf = processor.run()

tokenizer = EMRTokenizer.fromprocesseddf(temporaldf) trainds = EMRDataset(traindf, trainctx, tokenizer=tokenizer) MODELCONFIG['ctxdim'] = trainds.contextdf.shape[1] # Dinamically updating shape ```

2. Train Model

python from transform_emr.train import run_two_phase_training run_two_phase_training()

Model checkpoints and scaler are saved under checkpoints/phase1/ and checkpoints/phase2/. You can also split this part to it's components, running the preparedata(), phaseone(), phase_two() seperatly, but you'll need to adjust the imports. Use train.py structure for that.

3. Inference from the Model

```python import random import joblib from pathlib import Path from transformemr.embedder import EMREmbedding from transformemr.transformer import GPT from transformemr.dataset import DataProcessor, EMRTokenizer, EMRDataset from transformemr.config.model_config import *

# Load test data
df = pd.read_csv(TEST_TEMPORAL_DATA_FILE, low_memory=False)
ctx_df = pd.read_csv(TEST_CTX_DATA_FILE)

# Load tokenizer and scaler
tokenizer = EMRTokenizer.load(Path(CHECKPOINT_PATH) / "tokenizer.pt")
scaler = joblib.load(Path(CHECKPOINT_PATH) / "scaler.pkl")

# Run preprocessing
processor = DataProcessor(df, ctx_df, scaler=scaler, max_input_days=5)
df, ctx_df = processor.run()

patient_ids = df["PatientID"].unique()
df_subset = df[df["PatientID"].isin(patient_ids)].copy()
ctx_subset = ctx_df.loc[patient_ids].copy()

# Create dataset
dataset = EMRDataset(df_subset, ctx_subset, tokenizer=tokenizer)

# Load models
embedder, _, _, _, _ = EMREmbedding.load(EMBEDDER_CHECKPOINT, tokenizer=tokenizer)
model, _, _, _, _ = GPT.load(TRANSFORMER_CHECKPOINT, embedder=embedder)
model.eval()

# Run inference
result_df = infer_event_stream(model, dataset, temperature=1.0)  # optional: adjust temperature

```

This results_df will include both input events and generated events and will have these columns: {"PatientID", "Step", "Token", "IsInput", "IsOutcome", "IsTerminal", "TimePoint"}

You can analize the model's performance by comparing the input (dataset.tokens_df) to the output: - Were all complications generated? - Were all complications generated on time? (Set a forgiving boundry like 24h window)

4. Using as a module

You can perform local tests (not unit-tests) by activating the .py files, using the module as a package, as long as the file you are activating has main section.

For example, run this from the root: ```bash python -m transform_emr.train

Or

python -m transform_emr.inference

Both modules have a main activation to train / infer on a trained model

```

🧪 Running Unit-Tests

Run all tests:

Without validation prints: bash pytest unittests/

With validation prints: bash pytest -q -s unittests/

📦 Packaging Notes

To package without data/checkpoints:

```powershell

Clean up any existing temp folder

Remove-Item -Recurse -Force .\transformemrtemp -ErrorAction SilentlyContinue

Recreate the temp folder

New-Item -ItemType Directory -Path .\transformemrtemp | Out-Null

Copy only what's needed

Copy-Item -Path .\transformemr* -Destination .\transformemrtemp\transformemr -Recurse Copy-Item -Path .\setup.py, .\README.md, .\requirements.txt -Destination .\transformemrtemp

Zip it

Compress-Archive -Path .\transformemrtemp* -DestinationPath .\emr_model.zip -Force

Clean up

Remove-Item -Recurse -Force .\transformemrtemp ```

📌 Notes

This project uses synthetic EMR data (data/train/ and data/test/).
For best results, ensure consistent preprocessing when saving/loading models.
model_config.py: MODEL_CONFIG.ctx_dim should only be updated after dataset initialization to avoid embedding size mismatches. You should update this value with your full context dimention (without PatientID idx).

🔄 End-to-End Workflow

Raw EMR Tables │ ▼ Per-patient Event Tokenization (with normalized absolute timestamps) │ ▼ 🧠 Phase 1 – Train EMREmbedding (token + time + patient context) │ ▼ 📚 Phase 2 – Pre-train a Transformer decoder over learned embeddings, as a next-token-prediction task. │ ▼ → Predict next medical events and deduce outcome predictions from them (in evaluation.ipynb)

📦 Module Overview

1. `dataset.py` – Temporal EMR Preprocessing

| Component | Role | |---------------------|--------------------------------------------------------------------------------------------------| | DataProcessor | Performs all necessary data processing, from input data to tokensdf. | | EMRTokenizer | Transforming a processed temporaldf into a tokenizer that can be saved and passed between objects for compatability. | | EMRDataset | Converts raw EMR tables into per-patient token sequences with relative time. |

| collate_emr() | Pads sequences and returns tensors|

📌 Why it matters:
Medical data varies in density and structure across patients. This dynamic preprocessing handles irregularity while preserving medically-relevant sequencing via START/END logic and relative timing.

2. `embedder.py` – EMR Representation Learning

| Component | Role | |--------------------|---------------------------------------------------------------------------------------------------| | Time2Vec | Learns periodic + trend encoding from inter-event durations. | | EMREmbedding | Combines token, time, and patient context embeddings. Adds [CTX] token for global patient info. | | train_embedder() | Trains the embedding model with teacher-forced next-token prediction. |

⚙️ Phase 1: Learning Events Representation
Phase 1 learns a robust, patient-aware representation of their event sequences. It isolates the core structure of patient timelines without being confounded by the autoregressive depth of Transformers. The embedder uses: - 4 levels of tokens - The event token is seperated to 4 hierarichal components to impose similarity between tokens of the same domain: GLUCOSE -> GLUCOSE_TREND -> GLUCOSE_TREND_Inc -> GLUCOSE_TREND_Inc_START - 1 level of time - ABS T from ADMISSION, to understand global patterns and relationships between non sequential events.

The training uses next token prediction loss (k-window BCE) + time prediction MSE (Δt) + MLM prediction loss. MLM will avoid masking tokens which will damage the broader meaning like ADMISSION, [CTX], TERMINAL_OUTCOMES...

3. `transformer.py` – Causal Language Model over EMR Timelines

| Component | Role | |--------------------|---------------------------------------------------------------------------------------------------| | GPT | Transformer decoder stack over learned embeddings for next token prediction, with an additional head for deltat prediction. Model inputs a trained embedder. | | CausalSelfAttention | Multi-head attention using causal mask to enforce chronology. | | `traintransformer()` | Complete training logic for the model using a BCE with multi-hot targets to account for EMR irregularities. |

⚙️ Phase 2: Learning Sequence Dependencies
Once the EMR structure is captured, the transformer learns to model sequential dependencies in event progression:
- What tends to follow a certain event?
- How does timing affect outcomes?
- How does patient context modulate the trajectory?

The training uses next token prediction loss (k-window BCE) + time prediction MSE (Δt) + structural penalties. The training is guided by teacher's forcing, showing the model the correct context at every step (exposing [0, t-1] at step t from T where T is block_size), while also masking logits for illegal predictions based on the true trajectory. As training progress, the model's input ([0, t-1]) is partially masked (CBM) to teach the model to handle inaccuracies in generation, while avoiding masking same tokens as the EMREmbedding + MEAL + _START + _END tokens, to not clash with the penalties the model recieves.

The training flow uses a warmup period where the model is to learn patterns using a frozen embedder (so that the sharp gradients won't cause forgetting to the embedder's weights).

4. `inference.py` – Generating output from the model

| Component | Role | |--------------------|---------------------------------------------------------------------------------------------------| | get_token_embedding() | Select a token and get it's embeddings based on an input embedder. | | infer_event_stream() | Generate predicted stream of events on an input dataset (Test), using a masking process to block prediction of illegal tokens in relation to the predictions so far. |

NOTE: Unlike the parallel batching in the training process, inference on the transformer is step-by-step, hence slow (especially with the updating of illegal tokens on the fly).

5. `evaluation.ipynb` – Evaluation of the model's performance based on dynamic activations of `inference.py`.

| Component | Role | |--------------------|---------------------------------------------------------------------------------------------------| | evaluate_events | Calculates full classification evaluation methods given gold-standard DataFrame and generated DataFrame. | | evaluate_across_k | Handles Inference + Evaluation from pre-trained model across all K. | | plot_metrics_trend | Plots global evaluation over K. | | build_3x_matrix | Was the model able to predict a future RELEASE / COMPLICATION EATH?. | | build_full_outcome_matrix | Was the model able to predict a future specific OUTCOME (from dataset.config). | | build_timeaware_matrix | Was the model able to predict a future specific OUTCOME (from dataset.config) at the correct time? |

✅ Model Capabilities

✔️ Handles irregular time-series data using relative deltas and Time2Vec.
✔️ Captures both short- and long-range dependencies with deep transformer blocks.
✔️ Supports variable-length patient histories using custom collate and attention masks.
✔️ Imputes and predicts events in structured EMR timelines.

📚 Citation & Acknowledgments

This work builds on and adapts ideas from the following sources:

Time2Vec (Kazemi et al., 2019):
The temporal embedding design is adapted from the Time2Vec formulation.
📄 A. Kazemi, S. Ghamizi, A.-H. Karimi. "Time2Vec: Learning a Vector Representation of Time." NeurIPS 2019 Time Series Workshop.
arXiv:1907.05321
nanoGPT (Karpathy, 2023):
The training loop and transformer backbone are adapted from nanoGPT,
with modifications for multi-stream EMR inputs, multiple embeddings, and a k-step prediction loss.

Owner

Login: shaharoded
Kind: user

Repositories: 3
Profile: https://github.com/shaharoded

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this model or code, please cite the following work:"
title: "Predicting Clinical Outcomes and Assessing the Effects of Compliance to Diabetes Care Guidelines in Hospitals Using Time-Dependent Machine Learning Methods"
authors:
  - family-names: Oded
    given-names: Shahar
date-released: 2025-05-23
version: "1.0"
url: https://github.com/shaharoded/Transform-EMR
repository-code: https://github.com/shaharoded/Transform-EMR
license: CC-BY-NC-4.0
type: thesis
abstract: "This repository contains code and models for structured event prediction in EMR data using multi-phase training with a GPT architecture. Developed as part of a graduate thesis."