Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: TheLion-ai
  • License: cc-by-4.0
  • Language: Python
  • Default Branch: main
  • Size: 992 KB
Statistics
  • Stars: 51
  • Watchers: 5
  • Forks: 5
  • Open Issues: 31
  • Releases: 0
Created over 2 years ago · Last pushed 11 months ago
Metadata Files
Readme Citation

README.md

UMIE_datasets

contributors last update license

🤩 About the Project

Warning: This project is currently in alpha stage and may be subject to major changes

This repository presents a suite of unified scripts to standardize, preprocess, and integrate 882,774 images from 20 open-source medical imaging datasets, spanning modalities such as X-ray, CT, and MR. The scripts allow for seamless and fast download of a diverse medical data set. We create a unified set of annotations allowing for merging the datasets together without mislabelling. Each dataset is preprocessed with a custom sklearn pipeline. The pipeline steps are reusable across the datasets. The code was designed so that preorocessing a new dataset is simple and requires only reusing the available pipeline steps with customization performed through setting the appropriate values of the pipeline params.

The labels and segmentation masks were unified to be compliant with RadLex ontology.

Preprocessing_modules

Datasets

| uid | Dataset | Modality | TASK | |:---:|:--------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:----------------------------:| | 0 | KITS-23 | CT | Classification/Segmentation | | 1 | CoronaHack | XRAY | Classification | | 2 | Alzheimers Dataset | MRI | Classification | | 3 | Brain Tumor Classification | MRI | Classification | | 4 | COVID-19 Detection X-Ray | XRAY | Classification | | 5 | Finding and Measuring Lungs in CT Data | CT | Segmentation | | 6 | Brain CT Images with Intracranial Hemorrhage Masks | CT | Classification | | 7 | Liver and Liver Tumor Segmentation | CT | Classification, Segmentation | | 8 | Brain MRI Images for Brain Tumor Detection | MRI | Classification | | 9 | Knee Osteoarthritis Dataset with Severity Grading | XRAY | Classification | | 10 | Brain Tumor Progression | MRI | Segmentation | | 11 | Chest X-ray 14 | XRAY | Classification | | 12 | COCA- Coronary Calcium and chest CTs | CT | Segmentation | | 13 | BrainMetShare | MRI | Segmentation | | 14 | CT-ORG | CT | Segmentation | | 17 | LIDC-IDRI | CT | Segmentation | | 18 | CMMD | MG | Classification |

Using the datasets

Installing requirements

bash poetry install

Creating the dataset

Due to the copyright restrictions of the source datasets, we can't share the files directly. To obtain the full dataset you have to download the source datasets yourself and run the preprocessing scripts.

0.KITS-23

### KITS-23

  1. Clone the KITS-23 repository.
  2. Enter the KITS-23 directory and install the packages with pip. bash cd kits23 pip3 install -e .
  3. Run the following command to download the data to the dataset/ folder. kits23_download_data
  4. Fill in the source_path and target_path KITS-23Pipeline() in config/runner_config.py. e.g. python KITS23Pipeline( path_args={ "source_path": "kits23/dataset", # Path to the dataset directory in KITS23 repo "target_path": TARGET_PATH, "labels_path": "kits23/dataset/kits23.json", # Path to kits23.json }, dataset_args=dataset_config.KITS23 ),

1. Xray CoronaHack -Chest X-Ray-Dataset

1. Xray CoronaHack -Chest X-Ray-Dataset 1. Go to CoronaHack page on Kaggle. 2. Login to your Kaggle account. 3. Download the data. 4. Extract archive.zip. 5. Fill in the source_path to the location of the archive folder in CoronaHackPipeline() in config/runner_config.py.

2. Alzheimer's Dataset

2. Alzheimer's Dataset ( 4 class of Images) 1. Go to Alzheimer's Dataset page on Kaggle. 2. Login to your Kaggle account. 3. Download the data. 4. Extract archive.zip. 5. Fill in the source_path to the location of the archive folder in AlzheimersPipeline() in config/runner_config.py.

3. Brain Tumor Classification (MRI **3. Brain Tumor Classification (MRI)** 1. Go to [Brain Tumor Classification](https://www.kaggle.com/datasets/nikhilpandey360/chest-xray-masks-and-labels) page on Kaggle. 2. Login to your Kaggle account. 3. Download the data. 4. Extract `archive.zip`. 5. Fill in the `source_path` to the location of the `archive` folder in `BrainTumorClassificationPipeline()` in `config/runner_config.py`.
4. COVID-19 Detection X-Ray **4. COVID-19 Detection X-Ray** 1. Go to [COVID-19 Detection X-Ray](https://www.kaggle.com/datasets/darshan1504/covid19-detection-xray-dataset) page on Kaggle. 2. Login to your Kaggle account. 3. Download the data. 4. Extract `archive.zip`. 5. REMOVE **TrainData** folder. We do not want augmented data at this stage. 5. Fill in the `source_path` to the location of the `archive` folder in `COVID19DetectionPipeline()` in `config/runner_config.py`.
5. Finding and Measuring Lungs in CT Dat **5. Finding and Measuring Lungs in CT Data** 1. Go to [Finding and Measuring Lungs in CT Data](https://www.kaggle.com/datasets/kmader/finding-lungs-in-ct-data) page on Kaggle. 2. Login to your Kaggle account. 3. Download the data. 4. Extract `archive.zip`. 5. Fill in the `source_path` to the location of the `archive/2d_images` folder in `FindingAndMeasuringLungsPipeline()` in `config/runner_config.py`. Fill in `masks_path` with the location of the `archive/2d_masks` folder.
6. Brain CT Images with Intracranial Hemorrhage Masks **6. Brain CT Images with Intracranial Hemorrhage Masks** 1. Go to [Brain With Intracranial Hemorrhage](https://www.kaggle.com/datasets/vbookshelf/computed-tomography-ct-images) page on Kaggle. 2. Login to your Kaggle account. 3. Download the data. 4. Extract `archive.zip`. 5. Fill in the `source_path` to the location of the `archive` folder in `BrainWithIntracranialHemorrhagePipeline()` in `config/runner_config.py`. Fill in `masks_path` with the same path as the `source_path`.
7. Liver and Liver Tumor Segmentation (LITS) **7. Liver and Liver Tumor Segmentation (LITS)** 1. Go to [Liver and Liver Tumor Segmentation](https://www.kaggle.com/datasets/andrewmvd/lits-png). 2. Login to your Kaggle account. 3. Download the data. 4. Extract `archive.zip`. 5. Fill in the `source_path` to the location of the `archive` folder in `COVID19DetectionPipeline()` in `config/runner_config.py`. Fill in `masks_path` too.
8. Brain MRI Images for Brain Tumor Detection **8. Brain MRI Images for Brain Tumor Detection** 1. Go to [Brain MRI Images for Brain Tumor Detection](https://www.kaggle.com/datasets/jjprotube/brain-mri-images-for-brain-tumor-detection) page on Kaggle. 2. Login to your Kaggle account. 3. Download the data. 4. Extract `archive.zip`. 5. Fill in the `source_path` to the location of the `archive` folder in `BrainTumorDetectionPipeline()` in `config/runner_config.py`.
9. Knee Osteoarthrithis Dataset with Severity Grading **9. Knee Osteoarthrithis Dataset with Severity Grading** 1. Go to [Knee Osteoarthritis Dataset with Severity Grading](https://www.kaggle.com/datasets/shashwatwork/knee-osteoarthritis-dataset-with-severity). 2. Login to your Kaggle account. 3. Download the data. 4. Extract `archive.zip`. 5. Fill in the `source_path` to the location of the `archive` folder in `COVID19DetectionPipeline()` in `config/runner_config.py`.
10. Brain-Tumor-Progression **10. Brain-Tumor-Progression** 1. Go to [Brain Tumor Progression](https://wiki.cancerimagingarchive.net/display/Public/Brain-Tumor-Progression#339481190e2ccc0d07d7455ab87b3ebb625adf48) dataset from the cancer imaging archive.
11. Chest X-ray 14 **11. Chest X-ray 14** 1. Go to [Chest X-ray 14](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345). 2. Create an account. 3. Download the `images` folder and `DataEntry2017_v2020.csv`.
12. COCA- Coronary Calcium and chest CTs **12. COCA- Coronary Calcium and chest CTs** 1. Go to [COCA- Coronary Calcium and chest CTs](https://stanfordaimi.azurewebsites.net/datasets/e8ca74dc-8dd4-4340-815a-60b41f6cb2aa). 2. Log in or sign up for a Stanford AIMI account. 3. Fill in your contact details. 4. Download the data with [azcopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10). 5. Fill in the `source_path` with the location of the `cocacoronarycalciumandchestcts-2/Gated_release_final/patient` folder. Fill in `masks_path` with `cocacoronarycalciumandchestcts-2/Gated_release_final/calcium_xml` xml file.
13. BrainMetShare **13. BrainMetShare** 1. Go to [BrainMetShare](https://aimi.stanford.edu/brainmetshare). 2. Log in or sign up for a Stanford AIMI account. 3. Fill in your contact details. 4. Download the data with [azcopy](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10).
14. CT-ORG **14. CT-ORG** 1. Go to [CT-ORG](https://www.cancerimagingarchive.net/collection/ct-org/) page on Cancer imaging archive. 2. Download the data. 3. Extract `PKG - CT-ORG`. 4. Fill in the `source_path` to the location of the `OrganSegmentations` folder in `CtOrgPipeline()` in `config/runner_config.py`. Fill in `masks_path` with the same path as the `source_path`.
17. LIDC-IDRI **17. LIDC-IDRI** 1. Go to [LIDC-IDRI](https://www.cancerimagingarchive.net/collection/lidc-idri/). 2. Download "Images" using [NBIA Data Retriever](https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images), and "Radiologist Annotations/Segmentations". 3. Extract `LIDC-XML-only.zip`. 4. Fill in the `source_path` in `CmmdPipeline()` in `config/runner_config.py` with the location of the `manifest-{xxxxxxxxxxxxx}/LIDC-IDRI` directory. 5. Fill in the `masks_path` in `CmmdPipeline()` in `config/runner_config.py` with the location of the `LIDC-XML-only/` directory.
18. CMMD - The Chinese Mammography Database **18. CMMD** 1. Go to [CMMD](https://www.cancerimagingarchive.net/collection/cmmd/). 2. Download .tcia file from Data Access table. 3. Download [NBIA Data Retriver](https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images) to be able to download images. 4. Download CMMD_clinicaldata_revision.xlsx from Data Access table for labels information. 5. Fill in the `source_path` in `CmmdPipeline()` in `config/runner_config.py` with the location of the `manifest-{xxxxxxxxxxxxx}/CMMD` folder. 6. Fill in the `labels_path` in `CmmdPipeline()` in `config/runner_config.py` with the location of the `CMMD_clinicaldata_revision.xlsx` file.

To preprocess the dataset that is not among the above, search the preprocessing folder. It contains the reusable steps for changing imaging formats, extracting masks, creating file trees, etc. Go to the config file to check which masks and label encodings are available. Append new labels and mask encodings if needed.

Overall the dataset should have 882,774 images in .png format * CT - 500k+ * X-Ray - 250k+ * MRI - 100k+

🎯 Roadmap

  • [X] dcm
  • [x] jpg
  • [x] nii
  • [x] tif
  • [x] Shared radlex ontology
  • [ ] Huggingface datasets
  • [ ] Data dashboards

:wave: Contributors

:handshake: Contact

Barbara Klaudel

TheLion.AI

Development

Pre-commits

Install pre-commits https://pre-commit.com/#installation

If you are using VS-code install the extention https://marketplace.visualstudio.com/items?itemName=MarkLarah.pre-commit-vscode

To make a dry-run of the pre-commits to see if your code passes run pre-commit run --all-files

Adding python packages

Dependencies are handeled by poetry framework, to add new dependency run poetry add <package_name>

Debugging

To modify and debug the app, development in containers can be useful .

Testing

bash run_tests.sh

Owner

  • Name: TheLion-ai
  • Login: TheLion-ai
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Klaudel"
  given-names: "Barbara"
  orcid: "https://orcid.org/0000-0003-2422-3666"
- family-names: "Obuchowski"
  given-names: "Aleksander"
  orcid: "https://orcid.org/0000-0003-0617-9059"
- family-names: "Frąckowski"
  given-names: "Piotr"
- family-names: "Komor"
  given-names: "Andrzej"
- family-names: "Bober"
  given-names: "Kacper"
- family-names: "Badyra"
  given-names: "Wasyl"
title: "Towards Medical Foundational Model -- a Unified Dataset for Pretraining Medical Imaging Models"
version: 0.0.0
date-released: 2024-06-14
url: "https://github.com/TheLion-ai/UMIE_datasets"

GitHub Events

Total
  • Issues event: 23
  • Watch event: 26
  • Delete event: 11
  • Issue comment event: 1
  • Push event: 63
  • Pull request review comment event: 21
  • Pull request event: 23
  • Pull request review event: 25
  • Fork event: 5
  • Create event: 13
Last Year
  • Issues event: 23
  • Watch event: 26
  • Delete event: 11
  • Issue comment event: 1
  • Push event: 63
  • Pull request review comment event: 21
  • Pull request event: 23
  • Pull request review event: 25
  • Fork event: 5
  • Create event: 13

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 12
  • Total pull requests: 10
  • Average time to close issues: 3 months
  • Average time to close pull requests: 11 days
  • Total issue authors: 3
  • Total pull request authors: 6
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 12
  • Pull requests: 10
  • Average time to close issues: 3 months
  • Average time to close pull requests: 11 days
  • Issue authors: 3
  • Pull request authors: 6
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • A-Huli (30)
  • wbadyra (1)
  • bkuchnowski (1)
  • coduri (1)
  • andrekomor (1)
Pull Request Authors
  • A-Huli (4)
  • Piotr1219 (3)
  • AleksanderObuchowski (3)
  • staszewski (3)
  • KacperKnitter (2)
  • Dumbldore (2)
  • KacperRogala (2)
  • andrekomor (2)
  • mariusz-kosakowski (1)
  • bkuchnowski (1)
  • coduri (1)
  • DmitriyValetov (1)
  • mickamin (1)
Top Labels
Issue Labels
good first issue (7) enhancement (6) bug (3) documentation (1)
Pull Request Labels
bug (1)

Dependencies

.github/workflows/ci2.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
  • pre-commit/action v3.0.0 composite
.github/workflows/python-app.yml actions
  • Gr1N/setup-poetry v8 composite
  • actions/checkout v4 composite
  • actions/setup-python v4 composite
poetry.lock pypi
  • argcomplete 3.1.6
  • black 23.12.0
  • cfgv 3.4.0
  • charset-normalizer 3.3.2
  • click 8.1.7
  • colorama 0.4.6
  • commitizen 3.13.0
  • coverage 7.3.3
  • decli 0.6.1
  • defusedxml 0.7.1
  • distlib 0.3.8
  • exceptiongroup 1.2.0
  • filelock 3.13.1
  • identify 2.5.33
  • importlib-metadata 6.11.0
  • iniconfig 2.0.0
  • jinja2 3.1.2
  • joblib 1.3.2
  • markupsafe 2.1.3
  • mypy-extensions 1.0.0
  • nibabel 5.2.0
  • nodeenv 1.8.0
  • numpy 1.26.2
  • opencv-python 4.8.1.78
  • packaging 23.2
  • pandas 2.1.4
  • pathspec 0.12.1
  • pillow 10.1.0
  • platformdirs 4.1.0
  • pluggy 1.3.0
  • pre-commit 3.6.0
  • prompt-toolkit 3.0.36
  • pyaml 23.9.7
  • pydicom 2.4.4
  • pytest 7.4.3
  • pytest-cov 4.1.0
  • python-dateutil 2.8.2
  • pytz 2023.3.post1
  • pyyaml 6.0.1
  • questionary 2.0.1
  • ruff 0.0.282
  • scikit-learn 1.3.2
  • scipy 1.11.4
  • setuptools 69.0.2
  • six 1.16.0
  • termcolor 2.4.0
  • threadpoolctl 3.2.0
  • tomli 2.0.1
  • tomlkit 0.12.3
  • tqdm 4.66.1
  • types-pyyaml 6.0.12.12
  • typing-extensions 4.9.0
  • tzdata 2023.3
  • untangle 1.2.1
  • virtualenv 20.25.0
  • wcwidth 0.2.12
  • zipp 3.17.0
pyproject.toml pypi
  • black ^23.7.0 develop
  • commitizen ^3.6.0 develop
  • coverage ^7.3.2 develop
  • pre-commit ^3.5.0 develop
  • pytest ^7.4.0 develop
  • pytest-cov ^4.1.0 develop
  • ruff ^0.0.282 develop
  • types-pyyaml ^6.0.12.12 develop
  • nibabel ^5.2.0
  • opencv-python ^4.8.0.74
  • pandas ^2.1.3
  • pillow ^10.0.0
  • pyaml ^23.9.7
  • pydicom ^2.4.3
  • python ^3.9
  • pyyaml ^6.0.1
  • scikit-learn ^1.3.2
  • tqdm ^4.66.1
  • types-pyyaml ^6.0.12.12
  • untangle ^1.2.1