wholeslidedata

A package for working with whole-slide data including a fast batch iterator that can be used to train deep learning models.

https://github.com/diagnijmegen/pathology-whole-slide-data

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary

Keywords

computational-pathology deep-learning histopathology iterator pathology python whole-slide-annotation whole-slide-data whole-slide-dataset whole-slide-image whole-slide-imaging whole-slide-mask wsa wsd wsi wsm

Last synced: 6 months ago · JSON representation ·

Repository

A package for working with whole-slide data including a fast batch iterator that can be used to train deep learning models.

Basic Info

Host: GitHub
Owner: DIAGNijmegen
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://diagnijmegen.github.io/pathology-whole-slide-data/
Size: 93.7 MB

Statistics

Stars: 100
Watchers: 4
Forks: 30
Open Issues: 6
Releases: 1

Topics

Created over 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Citation

README.md

WholeSlideData

This repository contains software at a major version zero. Anything MAY change at any time. The public API SHOULD NOT be considered stable.

Please checkout the Documentation and the CHANGELOG.

Overview

Introduction
Installation
Main Features

Introduction

WholeSlideData aims to provide the tools to work with whole slide images and annotations from different vendors and annotation software. The main contribution is a batch iterator that enables users to sample patches from the data efficiently, fast, and easily.

Efficient

WholeSlideData preserves the annotations in a JSON format internally and uses the Shapely library to do essential computations on basic geometries. Using this design ensures that the required memory to keep all the annotations in memory is more efficient than converting the annotations to masks. Furthermore, this package allows for the generation of patches and labels on the fly, which eludes the need for saving them to disk.

Fast

Extracting patches for whole slide images is slow compared to saving the patches to PNGs or NumPy arrays and loading them directly from the disk. Though, saving to disk has some disadvantageous as this will generate a static dataset. For example, you can not switch easily to other patch shapes or magnifications with a static dataset. WholeSlideData takes advantage of Concurrent Buffer, which uses shared memory and allows for loading patches quickly via multiple workers to overcome the relatively slow serial extraction of patches from whole slide images. Using many extra CPUs will increase the RAM needed. Nevertheless, we think that with this package, a good trade-off can be made such that sampling is fast, efficient, and allows for easy switching to different settings.

Ease

WholeSlideData uses a configuration system called Dicfg that allows users to configure the batch terator in a single config file. Using multiple configuration files is also possible to create a clean and well-structured configuration for your project. Dicfg has some parallels with Hydra. So if you are familiar with Hydra, it should be straightforward to make your configuration file for the batch iterator. Dicfg lets you configure any setting, build instances, and insert these instances as dependencies for other functions or classes directly in the config file. Furthermore, users can build on top of base classes and will only need to change the configuration file to use custom code without the need to change any part of the batch iterator.

Installation

bash pip install git+https://github.com/DIAGNijmegen/pathology-whole-slide-data@main

Wholeslidedata supports various image backends. You will have to install at least one of these image backends.

Openslide is currently the default image backend, but you can easily switch between different image backends in the config file.

For example: yaml wholeslidedata: default: image_backend: asap

Main Features

Whole-slide images

Opening a Whole Slide image.

```python from wholeslidedata.image.wholeslideimage import WholeSlideImage

wsi = WholeSlideImage('pathtoimage.tif') patch = wsi.get_patch(x, y, width, height, spacing) ```

Whole-slide annotations

Currently, wholeslidedata supports annotations from the following annotation software: ASAP, QuPath, Virtum, and Histomicstk.

```python from wholeslidedata.annotation.wholeslideannotation import WholeSlideAnnotation

wsa = WholeSlideAnnotation('pathtoannotation.xml') annotations = wsa.select_annotations(x, y, width, height) ```

Batch iterator

The batch generator needs to be configured via user config file. In the user config file, custom and build-in sampling strategies can be configured, such as random, balanced, area-based, and more. Additionally. custom and build-in sample and batch callbacks can be composed such as fit_shape, one-hot-encoding, albumentations, and more. For a complete overview please check out the main config file and all the sub config files.

Example of a basic user config file (user_config.yml)

```yaml

wholeslidedata: default:

yaml_source: 
  training: 
    - 
      wsa: 
        path: /tmp/TCGA-21-5784-01Z-00-DX1.xml
      wsi: 
        path: /tmp/TCGA-21-5784-01Z-00-DX1.tif

labels: 
  stroma: 1
  tumor: 2
  lymphocytes: 3

batch_shape: 
  shape: [256, 256, 3]
  batch_size: 8
  spacing: 0.5

```

Creating a batch iterator ```python from wholeslidedata.iterators import createbatchiterator

with createbatchiterator(mode='training', userconfig='userconfig.yml', numberofbatches=10, cpus=4) as training_iterator:

for x_batch, y_batch, batch_info in training_iterator:
    pass

```

Acknowledgments

Created in the #EXAMODE project

Owner

Name: Diagnostic Image Analysis Group
Login: DIAGNijmegen
Kind: organization
Location: Radboud University Medical Center, Nijmegen, The Netherlands

Website: www.diagnijmegen.nl
Repositories: 41
Profile: https://github.com/DIAGNijmegen

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: van Rijthoven
    given-names: Mart
    orcid: https://orcid.org/0000-0003-3758-4348
title: WholeSlideData
doi: 10.5281/zenodo.7558991
date-released: 2023-01-22

GitHub Events

Total

Issues event: 3
Watch event: 9
Member event: 3
Issue comment event: 16
Push event: 7
Pull request event: 9
Fork event: 4

Last Year

Issues event: 3
Watch event: 9
Member event: 3
Issue comment event: 16
Push event: 7
Pull request event: 9
Fork event: 4

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 496
Total Committers: 8
Avg Commits per committer: 62.0
Development Distribution Score (DDS): 0.069

Top Committers

Name	Email	Commits
martvanrijthoven	m**n@g**m	462
thijsgelton	t**n@h**m	15
rj678	u**r@g**m	8
kkmz253	r**i@a**m	4
Witali Aswolinskiy	w**y@w**e	3
Jakub Kaczmarzyk	j**k@g**m	2
Leander van Eekelen	4**n@u**m	1
robinlomans	r**r@r**l	1

Committer Domains (Top 20 + Academic)

radboudumc.nl: 1 astrazeneca.com: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 35
Total pull requests: 28
Average time to close issues: about 1 month
Average time to close pull requests: about 17 hours
Total issue authors: 23
Total pull request authors: 12
Average comments per issue: 2.71
Average comments per pull request: 1.43
Merged pull requests: 26
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 10
Pull requests: 13
Average time to close issues: 1 day
Average time to close pull requests: about 22 hours
Issue authors: 9
Pull request authors: 6
Average comments per issue: 2.6
Average comments per pull request: 1.38
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

martvanrijthoven (5)
michelbotros (5)
leandervaneekelen (3)
robinlomans (2)
Vishwesh4 (2)
JoeySpronck (2)
mdeleeuw1 (1)
rj678 (1)
BestVIncent (1)
kaczmarj (1)
CyrilRJK (1)
siemdejong (1)
polejowska (1)
sumeetgadagkar (1)
ZhangRan24 (1)

Pull Request Authors

JoeySpronck (12)
thijsgelton (5)
leandervaneekelen (4)
rolandnemeth000 (4)
martvanrijthoven (3)
siemdejong (3)
carlijnlems (2)
kaczmarj (2)
tsikup (2)
rj678 (1)
daangeijs (1)
robinlomans (1)
Mat-Po (1)

Top Labels

Issue Labels

question (10) enhancement (6) bug (2) documentation (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 432 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 16
Total maintainers: 1

pypi.org: wholeslidedata

Homepage: https://github.com/DIAGNijmegen/pathology-whole-slide-data
Documentation: https://wholeslidedata.readthedocs.io/
License: LICENSE.txt
Latest release: 0.0.16
published about 3 years ago

Versions: 16
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 432 Last month

Rankings

Forks count: 8.6%

Stargazers count: 9.0%

Dependent packages count: 10.1%

Average: 13.8%

Downloads: 19.5%

Dependent repos count: 21.6%

Maintainers (1)

martvanrijthoven

Last synced: 7 months ago

Dependencies

docs/requirements.txt pypi

PyYAML >=5.4.1
concurrentbuffer >=0.0.3
creationism >=0.0.3
ipython *
jupyter *
jupytext *
myst-parser *
nbsphinx *
numpy >=1.18.1
opencv-python-headless >=4.4.0
openslide-python >=1.1.1
pydata_sphinx_theme *
scikit-image >=0.17.2
scipy >=1.5.2
shapely >=1.7.1
sphinx *
sphinx-autodoc-typehints *
sphinxcontrib-napoleon *

requirements.txt pypi

PyYAML >=5.4.1
albumentations >=1.1.0
concurrentbuffer >=0.0.7
creationism >=0.0.3
jsonschema >=4.4.0
matplotlib >=3.3.1
numpy >=1.20.2
opencv-python-headless >=4.4.0
openslide-python >=1.1.1
rtree >=1.0.0
scikit-image >=0.17.2
scipy >=1.5.2
shapely ==1.7.1

setup.py pypi

PyYAML >=5.4.1
concurrentbuffer >=0.0.7
creationism >=0.0.5
jsonschema >=4.4.0
numpy >=1.20.2
opencv-python-headless >=4.4.0
openslide-python >=1.1.1
rtree ==1.0.0
scikit-image >=0.17.2
scipy >=1.5.2
shapely >=1.7.1

.github/workflows/docs.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
peaceiris/actions-gh-pages v3 composite

.github/workflows/tests.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

Dockerfile docker

nvidia/cuda 11.1.1-cudnn8-runtime-ubuntu20.04 build

tests/requirements.txt pypi

PyYAML >=6.0 test
albumentations ==1.2.1 test
boto3 >=1.26.45 test
click >=8.1.3 test
concurrentbuffer >=0.0.8 test
dicfg >=0.0.7 test
gdown >=4.6.0 test
lxml >=4.6.3 test
matplotlib >=3.6.2 test
numpy >=1.23.4 test
opencv-python-headless >=4.6.0.66 test
openslide-python >=1.2.0 test
rtree >=1.0.0 test
scikit-image >=0.19.3 test
shapely >=1.7.1 test
sourcelib >=0.0.3 test
tiffslide >=1.10.1 test