octess
An algorithm to automatically extract clinical and demographic data from Cirrus SD-OCT macular cube reports
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary
Repository
An algorithm to automatically extract clinical and demographic data from Cirrus SD-OCT macular cube reports
Basic Info
- Host: GitHub
- Owner: MichaelBalas
- License: gpl-3.0
- Language: Python
- Default Branch: master
- Size: 13.5 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
OCTess
Welcome to the repository for our research paper on automating the data extraction process of macular cube spectral domain optical coherence tomography (SD-OCT) data using optical character recognition (OCR) and deep learning. The algorithm we developed, named OCTess (portmanteau of OCT and Tesseract), is highly accurate, efficient, and a time-saving alternative to manual data extraction.
Summary
In this study, we focused on developing an OCR algorithm, OCTess, to automatically extract clinical and demographic data from Cirrus SD-OCT macular cube reports. Our algorithm utilizes multiple models from Tesseract, an open-source OCR software library, and leverages pixel-based bounding box coordinates for each field of interest in the macular cube report. The extracted data is processed through a series of image processing operations to convert it to text.
OCTess extracts SD-OCT macular cube data with near-perfect and equivalent accuracy to a human while being significantly more efficient.
Getting Started
To use OCTess, please follow these steps:
1. Clone this repository
2. Ensure you have the required dependencies installed, as listed in requirements.txt
3. Move your Cirrus SD-OCT PDF/PNG files into the Input/ directory. Alternatively, you can use the 5 example files that are already provided
4. Run the bash script ./run.sh to execute the OCR algorithm and validate the results using the provided dataset
Repository Structure
Input/: Input your raw SD-OCT macular cube reports in this directory. Delete the example files if you do not need them
tessdata/: Directory of saved Tesseract deep learning and legacy models
patterns/: Regex pattern rules used for data extraction
pdf_to_img.py: Python script to convert PDF files to PNG format (if they are not already PNG)
extract_OCT.py: Python script to extract data from each PNG file, organize it into a table and generate OCTess.xlsx
verify_OCT.py: Python script that performs a series of verifications and highlights regions of OCTess.xlsx that may be erroneous
requirements.txt: Lists the necessary dependencies for this project
Contributing
We welcome contributions to improve the algorithm or expand its applicability. Please feel free to submit issues, pull requests, or contact the authors directly.
Author Contact:
Michael Balas: michael.balas@mail.utoronto.ca
Rajeev H. Muni: rajeev.muni@utoronto.ca
License
This project is licensed under the GNU GPLv3 License. See the LICENSE file for details.
Citation
If you use this code or the results from our research paper, please cite our work:
Balas, M., Herman, J., Bhambra, N., Longwell, J., Popovic, M., Melo, I., & Muni, R. (2023). OCTess: An Optical Character Recognition Algorithm for Automated Data Extraction of Spectral Domain Optical Coherence Tomography Reports. RETINA. https://doi.org/10.0000/00000

Owner
- Name: Michael Balas
- Login: MichaelBalas
- Kind: user
- Location: Toronto, ON
- Repositories: 1
- Profile: https://github.com/MichaelBalas
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Balas"
given-names: "Michael"
orcid: "https://orcid.org/0000-0002-5948-0331"
title: "My Research Software"
version: 1.0.0
doi: 10.1097/IAE.0000000000003990
date-released: 2023-11-06
url: "https://github.com/MichaelBalas/OCTess"
preferred-citation:
type: article
authors:
- family-names: "Balas"
given-names: "Michael"
orcid: "https://orcid.org/0000-0002-5948-0331"
- family-names: "Herman"
given-names: "Josh"
- family-names: "Bhambra"
given-names: "Nishaant"
- family-names: "Longwell"
given-names: "Jack"
- family-names: "Popovic"
given-names: "Marko"
- family-names: "Melo"
given-names: "Isabela"
- family-names: "Muni"
given-names: "Rajeev"
doi: "10.1097/IAE.0000000000003990"
journal: "RETINA"
month: 11
#start: 1 # First page number
#end: 10 # Last page number
title: "OCTess: An Optical Character Recognition Algorithm for Automated Data Extraction of Spectral Domain Optical Coherence Tomography Reports"
#issue: 1
#volume: 1
year: 2023
GitHub Events
Total
- Watch event: 1
- Fork event: 1
Last Year
- Watch event: 1
- Fork event: 1
Dependencies
- Jinja2 *
- numpy *
- opencv-python *
- openpyxl *
- pandas *
- pdf2image *
- pytesseract *