zensols-mimicsid

MIMIC-III corpus parsing and section prediction with MedSecId (COLING paper)

https://github.com/plandes/mimicsid

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.6%) to scientific vocabulary

Keywords

biomedical clinical docker medical mimic-iii natural-language-processing parsers
Last synced: 6 months ago · JSON representation ·

Repository

MIMIC-III corpus parsing and section prediction with MedSecId (COLING paper)

Basic Info
Statistics
  • Stars: 4
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
biomedical clinical docker medical mimic-iii natural-language-processing parsers
Created over 3 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

MIMIC-III corpus parsing and section prediction with MedSecId

PyPI Python 3.11 Build Status

This repository contains the a Python package to automatically segment and identify sections of clinical notes, such as electronic health record (EHR) medical documents. It also provides access to the MedSecId section annotations with MIMIC-III corpus parsing from the paper A New Public Corpus for Clinical Section Identification: MedSecId. See the medsecid repository to reproduce the results from the paper.

This package provides the following:

  • The same access to MIMIC-III data as provided in the mimic package.
  • Access to the annotated MedSecId notes as an easy to use Python object graph.
  • The pretrained model inferencing, which produces a similar Python object graph to the annotations (provides the class PredictedNote instead of an AnnotatedNote class.

Table of Contents

Documentation

See the full documentation. The API reference is also available.

Installation

Because the this library has many dependencies and many moving parts, it is best to create a new environment using conda:

bash wget https://github.com/plandes/mimicsid/raw/refs/heads/master/environment.yml conda env create -f environment.yml conda activate mimicsid

The library can also installed with pip from the pypi repository: bash pip3 install zensols.mimicsid

The models used by the package are automatically downloaded on the first use.

If you only want to predict sections using the pretrained model, you need only to install the package. However, if you want to access the annotated notes, you must install a Postgres MIMIC-III database as mimic package install section.

Usage

This package provides models to predict sections of a medical note and access to the MIMIC-III section annotations available on Zenodo. The first time it is run it will take a while to download the annotation set and the pretrained models.

See the examples for the complete code and additional documentation.

Prediction Usage

The SectionPredictor class creates section annotation span IDs/types and header token spans. See the example below:

```python from zensols.nlp import FeatureToken from zensols.mimic import Section from zensols.mimicsid import PredictedNote, ApplicationFactory from zensols.mimicsid.pred import SectionPredictor

if (name == 'main'): # get the section predictor from the application context in the app sectionpredictor: SectionPredictor = ApplicationFactory.sectionpredictor()

# read in a test note to predict
with open('../../test-resources/note.txt') as f:
    content: str = f.read().strip()

# predict the sections of read in note and print it
note: PredictedNote = section_predictor.predict([content])[0]
note.write()

# iterate through the note object graph
sec: Section
for sec in note.sections.values():
    print(sec.id, sec.name)

# concepts or special MIMIC tokens from the addendum section
sec = note.sections_by_name['history-of-present-illness'][0]
tok: FeatureToken
for tok in sec.body_doc.token_iter():
    print(tok, tok.mimic_, tok.cui_)

```

Annotation Access

Annotated notes are provided as a Python Note class, which contains most of the MIMIC-III data from the NOTEEVENTS table. This includes not only the text, but parsed FeatureDocument instances. However, you must build a Postgres database and provide a login to it in the application as detailed below:

```python from zensols.config import IniConfig from zensols.mimic import Section from zensols.mimicsid import ApplicationFactory from zensols.mimic import Note from zensols.mimicsid import AnnotatedNote, NoteStash

if (name == 'main'): # create a configuration with the Postgres database login config = IniConfig('db.conf') # get the dict like data structure that has notes by row_id notestash: NoteStash = ApplicationFactory.notestash( **config.getoptions(section='mimicpostgresconnmanager'))

# get a note by `row_id`
note: Note = note_stash[14793]

# iterate through the note object graph
sec: Section
for sec in note.sections.values():
    print(sec.id, sec.name)

```

Models

You can mix and match models across section vs. header models (see Performance Metrics). By default the package uses the best performing models but you can select the model you want by adding a configuration file and specifying it on the command line with -c:

ini [mimicsid_default] section_prediction_model = bilstm-crf-tok-fasttext header_prediction_model = bilstm-crf-tok-glove-300d

The resources live on Zenodo and are automatically downloaded on the first time the program is used in the ~/.cache directory (or similar home directory on Windows).

MedCAT Models

The dependency mednlp package package uses the default MedCAT model.

Performance Metrics

The distributed models add in the test set to the training set to improve the performance for inferencing, which is why only the validation metrics are given. The validation set performance of the pretrained models are given below, where:

  • wF1 is the weighted F1
  • mF1 is the micro F1
  • Mf1 is the macro F1
  • acc is the accuracy

Fundamental API changes have necessitated subsequent versions of the model. Each version of this package is tied to a model version. While some minor changes of each version might present language parsing differences such as sentence chunking, metrics are most likely statistically insignificant.

Version 0.1.1

The version was released to accommodate for Zensols framework upgrades.

| Name | Type | Id | wF1 | mF1 | MF1 | acc | |-------------------------------|---------|----------------------------------------|-------|-------|-------|-------| | BiLSTM-CRF_tok (fastText) | Section | bilstm-crf-tok-fasttext-section-type | 0.921 | 0.929 | 0.787 | 0.929 | | BiLSTM-CRF_tok (GloVE 300D) | Section | bilstm-crf-tok-glove-300d-section-type | 0.939 | 0.944 | 0.841 | 0.944 | | BiLSTM-CRF_tok (fastText) | Header | bilstm-crf-tok-fasttext-header | 0.996 | 0.996 | 0.961 | 0.996 | | BiLSTM-CRF_tok (GloVE 300D) | Header | bilstm-crf-tok-glove-300d-header | 0.996 | 0.996 | 0.962 | 0.996 |

Version 0.1.0

Adding biomedical NER improved the 0.1.0 models (see Model Differences). In addition to the reported validation scores of the production models below, the BiLSTM-CRF_tok (GloVE 300D) section model achieved an improved weighted F1 of 0.9572, micro F1 of 0.959, macro F1 of 0.8163.

| Name | Type | Id | wF1 | mF1 | MF1 | acc | |-------------------------------|---------|----------------------------------------|-------|-------|-------|-------| | BiLSTM-CRF_tok (fastText) | Section | bilstm-crf-tok-fasttext-section-type | 0.923 | 0.933 | 0.764 | 0.933 | | BiLSTM-CRF_tok (GloVE 300D) | Section | bilstm-crf-tok-glove-300d-section-type | 0.936 | 0.941 | 0.810 | 0.941 | | BiLSTM-CRF_tok (fastText) | Header | bilstm-crf-tok-fasttext-header | 0.996 | 0.996 | 0.961 | 0.996 | | BiLSTM-CRF_tok (GloVE 300D) | Header | bilstm-crf-tok-glove-300d-header | 0.996 | 0.996 | 0.964 | 0.996 |

Version 0.0.3

The version was released to accommodate for Zensols framework upgrades.

| Name | Type | Id | wF1 | mF1 | MF1 | acc | |-------------------------------|---------|----------------------------------------|-------|-------|-------|-------| | BiLSTM-CRF_tok (fastText) | Section | bilstm-crf-tok-fasttext-section-type | 0.911 | 0.917 | 0.792 | 0.917 | | BiLSTM-CRF_tok (GloVE 300D) | Section | bilstm-crf-tok-glove-300d-section-type | 0.929 | 0.933 | 0.810 | 0.933 | | BiLSTM-CRF_tok (fastText) | Header | bilstm-crf-tok-fasttext-header | 0.996 | 0.996 | 0.965 | 0.996 | | BiLSTM-CRF_tok (GloVE 300D) | Header | bilstm-crf-tok-glove-300d-header | 0.996 | 0.996 | 0.962 | 0.996 |

Version 0.0.2

The version was released to accommodate for Zensols framework upgrades.

| Name | Type | Id | wF1 | mF1 | MF1 | acc | |-------------------------------|---------|----------------------------------------|-------|-------|-------|-------| | BiLSTM-CRF_tok (fastText) | Section | bilstm-crf-tok-fasttext-section-type | 0.918 | 0.925 | 0.797 | 0.925 | | BiLSTM-CRF_tok (GloVE 300D) | Section | bilstm-crf-tok-glove-300d-section-type | 0.917 | 0.922 | 0.809 | 0.922 | | BiLSTM-CRF_tok (fastText) | Header | bilstm-crf-tok-fasttext-header | 0.996 | 0.996 | 0.959 | 0.996 | | BiLSTM-CRF_tok (GloVE 300D) | Header | bilstm-crf-tok-glove-300d-header | 0.996 | 0.996 | 0.962 | 0.996 |

Differences from the Paper Repository

The paper medsecid repository has quite a few differences, mostly around reproducibility. However, this repository is designed to be a package used for research that applies the model. To reproduce the results of the paper, please refer to the [medsicid repository]. To use the best performing model (BiLSTM-CRF token model) from that paper, then use this repository.

Perhaps the largest difference is that this repository has a pretrained model and code for header tokens. This is a separate model whose header token predictions are "merged" with the section ID/type predictions.

The differences in performance between the section ID/type models and metrics reported involve several factors. The primary difference being that released models were trained on the test data with only validation performance metrics reported to increase the pretrained model performance. Other changes include:

  • Uses the mednlp package, which uses MedCAT to parse clinical medical text. This includes changes such as fixing misspellings and expanding acronyms.
  • Uses the mimic package, which builds on the mednlp package and parses [MIMIC-III] text by configuring the spaCy tokenizer to deal with pseudo tokens (i.e. [**First Name**]). This is a significant change given how these tokens are treated between the models and term mapping (Pt. becomes patient). This was changed so the model will work well on non-MIMIC data.
  • Feature sets differences such as provided by the Zensols Deep NLP package.
  • Model changes include LSTM hidden layer parameter size and activation function.
  • White space tokens are removed in medsecid repository and added back in this package to give additional cues to the model on when to break a section. However, this might have had the opposite effect.

There are also changes in the Python interpreter and libraries used:

  • Python version 3.9 to 3.11.
  • PyTorch was upgraded from 1.9.1 to 2.1.2
  • spaCy was upgraded from 3.0.7 to 3.6.1
  • HuggingFace Transformers 4.11.3 to 4.35.2
  • scispaCy 0.4.0 to 0.5.3

Model Differences

Starting with Version 0.1.0, named entities include those predicted from the scispaCy biomedical NER (en_ner_bionlp13cg_md) trained model. Compressed model files are also smaller in size.

Training

This document explains how to create and package models for distribution.

Preprocessing Step

  1. To train the model, first install the MIMIC-III Postgres database per the mimic package instructions in the Installation section.
  2. Copy the system configuration file: bash cp config/system-template.conf config/system.conf
  3. Add the MIMIC-III Postgres credentials and database configuration to config/system.conf.
  4. Vectorize the batches using the preprocessing script: ./src/bin/preprocess.sh. This also creates cached hospital admission and spaCy data parse files.

Training and Testing

To get performance metrics on the test set by training on the training, use the command: ./dist traintest -c config/glove300.conf for the section ID model. The configuration file can be any of those in the models directory. For the header model use:

bash ./dist traintest -c config/glove300.conf --override mimicsid_default.model_type=header

Training Production Models

TL;DR: if you're feeling lucky:

  1. Update the new model version in:
    • resources/default.conf for property msid_model:version.
    • [dist-resources/app.conf][dist-resources/app.conf] for property deeplearn_model_packer:version
  2. Run detached from the console since it will take about a day to train all four models: nohup src/bin/all.sh > train.log 2>&1 &
  3. Recreate the environment file: make envfile

However, there are many moving parts and libraries with many things that can go wrong. More in-depth training instructions follow.

To train models used in your projects, train the model on both the training and test sets. This still leaves the validation set to inform when to save for epochs where the loss decreases:

  1. Update the version in the deeplearn_model_packer section in file dist-resources/app.conf.
  2. Update the same version in the msid_model section in file resources/default.conf.
  3. Preprocess the data (see the preprocessing section).
  4. Important: Remember to remove the passwords and database configuration in config/system.conf: bash cp config/system.conf config/system-sensitive-data.conf cat /dev/null > config/system.conf
  5. Run the script that trains the models and packages them: src/bin/package.sh.
  6. Revert the configuration files: bash git checkout -- dist-resources/app.conf resources/default.conf

Citation

If you use this project in your research please use the following BibTeX entry:

bibtex @inproceedings{landes-etal-2022-new, title = "A New Public Corpus for Clinical Section Identification: {M}ed{S}ec{I}d", author = "Landes, Paul and Patel, Kunal and Huang, Sean S. and Webb, Adam and Di Eugenio, Barbara and Caragea, Cornelia", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.326", pages = "3709--3721" }

Also please cite the Zensols Framework:

bibtex @inproceedings{landes-etal-2023-deepzensols, title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility", author = "Landes, Paul and Di Eugenio, Barbara and Caragea, Cornelia", editor = "Tan, Liling and Milajevs, Dmitrijs and Chauhan, Geeticka and Gwinnup, Jeremy and Rippeth, Elijah", booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)", month = dec, year = "2023", address = "Singapore, Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.nlposs-1.16", pages = "141--146" }

Docker

A docker image is now available as well.

To use the docker image, do the following:

  1. Create (or obtain) the Postgres docker image
  2. Clone this repository git clone --recurse-submodules https://github.com/plandes/mimicsid
  3. Set the working directory to the repo: cd mimicsid
  4. Copy the configuration from the installed mimicdb image configuration: make -C docker/mimicdb SRC_DIR=<cloned mimicdb directory> cpconfig
  5. Start the container: make -C docker/app up
  6. Test sectioning a document: make -C docker/app testdumpsec
  7. Log in to the container: make -C docker/app devlogin
  8. Output a note to a temporary file: mimic note 1118471 > note.txt
  9. Predict the sections on the note: mimicsid predict note.txt
  10. Look at the section predictions: cat preds/note-pred.txt

Changelog

An extensive changelog is available here.

Community

Please star this repository and let me know how and where you use this API. Contributions as pull requests, feedback and any input is welcome.

License

MIT License

Copyright (c) 2022 - 2025 Paul Landes

Owner

  • Name: Paul Landes
  • Login: plandes
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
title: 'Baseline model for the MedSecId paper'
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
date-released: 2022-10-12
repository-code: https://github.com/plandes/deeplearn
authors:
  - given-names: Paul
    family-names: Landes
    email: landes@mailc.net
    affiliation: University of Illinois at Chicago
    orcid: 'https://orcid.org/0000-0003-0985-0864'
preferred-citation:
  type: conference-paper
  authors:
    - given-names: Paul
      family-names: Landes
      email: landes@mailc.net
      affiliation: University of Illinois at Chicago
      orcid: 'https://orcid.org/0000-0003-0985-0864'
    - given-names: Kunal
      family-names: Patel
      affiliation: Department of Emergency Medicine, University of Illinois at Chicago
    - given-names: Sean
      family-names: Huang
      affiliation: Department of Internal Medicine and Geriatrics,University of Illinois at Chicago
    - given-names: Adam
      family-names: Web
      affiliation: Department of Emergency Medicine, University of Illinois at Chicago
    - given-names: Barbara
      family-names: Di Eugenio
      affiliation: University of Illinois at Chicago
    - given-names: Cornelia
      family-names: Caragea
      affiliation: University of Illinois at Chicago
  title: 'A New Public Corpus for Clinical Section Identification: MedSecId'
  url: https://aclanthology.org/2022.coling-1.326
  year: 2022
  conference:
    name: Proceedings of the 29th International Conference on Computational Linguistics
    city: Gyeongju
    country: KR
    date-start: 2022-10-12
    date-end: 2022-10-17

GitHub Events

Total
  • Issues event: 2
  • Watch event: 2
  • Issue comment event: 3
  • Push event: 17
  • Create event: 3
Last Year
  • Issues event: 2
  • Watch event: 2
  • Issue comment event: 3
  • Push event: 17
  • Create event: 3

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 212
  • Total Committers: 1
  • Avg Commits per committer: 212.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 23
  • Committers: 1
  • Avg Commits per committer: 23.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Paul Landes l****s@m****t 212
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 4
  • Total pull requests: 0
  • Average time to close issues: 16 days
  • Average time to close pull requests: N/A
  • Total issue authors: 4
  • Total pull request authors: 0
  • Average comments per issue: 7.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: 18 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 3.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • griff4692 (1)
  • mxhm (1)
  • stevenbedrick (1)
  • evanbrociner (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 57 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 15
  • Total maintainers: 1
pypi.org: zensols-mimicsid

This repository contains the a Python package to automatically segment and identify sections of clinical notes, such as electronic health record (EHR) medical documents.

  • Versions: 15
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 57 Last month
Rankings
Dependent packages count: 8.9%
Downloads: 20.2%
Average: 26.5%
Dependent repos count: 50.3%
Maintainers (1)
Last synced: 6 months ago

Dependencies

src/python/requirements.txt pypi
  • zensols.deepnlp *
  • zensols.mimic *
docker/app/Dockerfile docker
  • debian 12.0 build
docker/app/docker-compose.yml docker
  • plandes/mimicsid latest
  • postgres 9.6
src/python/requirements-model.txt pypi
src/python/setup.py pypi