octess

An algorithm to automatically extract clinical and demographic data from Cirrus SD-OCT macular cube reports

https://github.com/michaelbalas/octess

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

An algorithm to automatically extract clinical and demographic data from Cirrus SD-OCT macular cube reports

Basic Info

Host: GitHub
Owner: MichaelBalas
License: gpl-3.0
Language: Python
Default Branch: master
Size: 13.5 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

OCTess

Welcome to the repository for our research paper on automating the data extraction process of macular cube spectral domain optical coherence tomography (SD-OCT) data using optical character recognition (OCR) and deep learning. The algorithm we developed, named OCTess (portmanteau of OCT and Tesseract), is highly accurate, efficient, and a time-saving alternative to manual data extraction.

Summary

In this study, we focused on developing an OCR algorithm, OCTess, to automatically extract clinical and demographic data from Cirrus SD-OCT macular cube reports. Our algorithm utilizes multiple models from Tesseract, an open-source OCR software library, and leverages pixel-based bounding box coordinates for each field of interest in the macular cube report. The extracted data is processed through a series of image processing operations to convert it to text.

OCTess extracts SD-OCT macular cube data with near-perfect and equivalent accuracy to a human while being significantly more efficient.

Getting Started

To use OCTess, please follow these steps: 1. Clone this repository 2. Ensure you have the required dependencies installed, as listed in requirements.txt 3. Move your Cirrus SD-OCT PDF/PNG files into the Input/ directory. Alternatively, you can use the 5 example files that are already provided 4. Run the bash script ./run.sh to execute the OCR algorithm and validate the results using the provided dataset

Repository Structure

Input/: Input your raw SD-OCT macular cube reports in this directory. Delete the example files if you do not need them

tessdata/: Directory of saved Tesseract deep learning and legacy models

patterns/: Regex pattern rules used for data extraction

pdf_to_img.py: Python script to convert PDF files to PNG format (if they are not already PNG)

extract_OCT.py: Python script to extract data from each PNG file, organize it into a table and generate OCTess.xlsx

verify_OCT.py: Python script that performs a series of verifications and highlights regions of OCTess.xlsx that may be erroneous

requirements.txt: Lists the necessary dependencies for this project

Contributing

We welcome contributions to improve the algorithm or expand its applicability. Please feel free to submit issues, pull requests, or contact the authors directly.

Author Contact:

Michael Balas: michael.balas@mail.utoronto.ca

Rajeev H. Muni: rajeev.muni@utoronto.ca

License

This project is licensed under the GNU GPLv3 License. See the LICENSE file for details.

Citation

If you use this code or the results from our research paper, please cite our work:

Balas, M., Herman, J., Bhambra, N., Longwell, J., Popovic, M., Melo, I., & Muni, R. (2023). OCTess: An Optical Character Recognition Algorithm for Automated Data Extraction of Spectral Domain Optical Coherence Tomography Reports. RETINA. https://doi.org/10.0000/00000

OCTess Logo

Owner

Name: Michael Balas
Login: MichaelBalas
Kind: user
Location: Toronto, ON

Repositories: 1
Profile: https://github.com/MichaelBalas

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Balas"
  given-names: "Michael"
  orcid: "https://orcid.org/0000-0002-5948-0331"
title: "My Research Software"
version: 1.0.0
doi: 10.1097/IAE.0000000000003990
date-released: 2023-11-06
url: "https://github.com/MichaelBalas/OCTess"
preferred-citation:
  type: article
  authors:
  - family-names: "Balas"
    given-names: "Michael"
    orcid: "https://orcid.org/0000-0002-5948-0331"
  - family-names: "Herman"
    given-names: "Josh"
  - family-names: "Bhambra"
    given-names: "Nishaant"
  - family-names: "Longwell"
    given-names: "Jack"
  - family-names: "Popovic"
    given-names: "Marko"
  - family-names: "Melo"
    given-names: "Isabela"
  - family-names: "Muni"
    given-names: "Rajeev"
  doi: "10.1097/IAE.0000000000003990"
  journal: "RETINA"
  month: 11
  #start: 1 # First page number
  #end: 10 # Last page number
  title: "OCTess: An Optical Character Recognition Algorithm for Automated Data Extraction of Spectral Domain Optical Coherence Tomography Reports"
  #issue: 1
  #volume: 1
  year: 2023

GitHub Events

Total

Watch event: 1
Fork event: 1

Last Year

Watch event: 1
Fork event: 1

Dependencies

requirements.txt pypi

Jinja2 *
numpy *
opencv-python *
openpyxl *
pandas *
pdf2image *
pytesseract *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science