ai4bmr-datasets

https://github.com/ai4scr/ai4bmr-datasets

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: AI4SCR
License: mit
Language: Python
Default Branch: main
Size: 7.2 MB

Statistics

Stars: 2
Watchers: 4
Forks: 2
Open Issues: 3
Releases: 0

Created almost 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License Codemeta

SpatialProteomicsNet

SpatialProteomicsNet is an open-source Python package providing a unified interface to access, prepare, and load spatial proteomics datasets (e.g., Imaging Mass Cytometry and MIBI) for machine learning workflows.

For a detailed description, motivation, and full list of supported technologies and use cases, please refer to paper.md.

📦 Supported Datasets

The package supports the following public spatial proteomics datasets:

Keren et al. 2018 – IMC of triple-negative breast cancer [@Keren2018]
Jackson et al. 2020 – IMC of breast cancer [@jacksonSinglecellPathologyLandscape2020]
Danenberg et al. 2022 – IMC of breast cancer [@danenbergBreastTumorMicroenvironment2022]
Cords et al. 2024 – IMC of NSCLC [@cordsCancerassociatedFibroblastPhenotypes2024]

| | name | images | masks | markers | annotated cells | clinical samples | |-:|:--------------|-------:|------:|--------:|----------------:|-----------------:| | | Danenberg2022 | 794 | 794 | 39 | 1123466 | 794 | | | Cords2024 | 2070 | 2070 | 43 | 5984454 | 2072 | | | Jackson2020 | 735 | 735 | 35 | 1224411 | 735 | | | Keren2018 | 41 | 41 | 36 | 201656 | 41 |

Table 1: Summary statistics of supported spatial proteomics datasets in the package.

Dummy datasets are also included for development and testing purposes.

🧱 Data Model

All datasets are represented in a consistent format, enabling the same code to work across different datasets. After preparing and setting up the dataset instance, the following attributes are available:

images – Dictionary of image objects
panel – Imaging panel description (DataFrame), including a key target
masks – Dictionary of mask objects
metadata – Observation-level metadata (pandas DataFrame) including a key label
clinical – Clinical / Sample-level metadata (pandas DataFrame)
intensity – Object / Single-cell intensity features (DataFrame)
spatial – Object / Single-cell spatial features (DataFrame)

These components are enforced and standardized across datasets.

Data Versions

A dataset can have different versions of images, masks, features, and metadata, allowing for flexibility in data access.
To load the data as published, use {image,mask,metadata,feature}_version="published". Be aware that for most datasets, this means the loaded data is not harmonized. You may encounter inconsistencies such as:

objects present in the masks that are not listed in the metadata,
images without matching annotations,
or cells in the metadata with no corresponding segmentation.

To work with a cleaned and more consistent dataset, some datasets offer an annotated version. This version:

has cleaned masks to include only objects found in both the segmentation and the metadata,
only contains images that have corresponding metadata.

Efficient Data Loading: Images and masks are lazily loaded from disk to avoid unnecessary memory use. Tabular data are stored in Parquet format for fast access.

⚙️ Installation

Prerequisites

Python >= 3.10
R >= 4.4.1

To install R, please visit the R project website and follow the installation instructions for your operating system. Alternatively, you can install R using conda.

Environment Setup

We recommend using a virtual environment to install the required dependencies.

Using `venv`

```bash

Create a virtual environment

python3 -m venv .venv source .venv/bin/activate

Install the package

pip install git+https://github.com/AI4SCR/ai4bmr-datasets.git ```

Using `conda`

```bash

Create a conda environment

conda create -n ai4bmr-datasets python=3.12 -y conda activate ai4bmr-datasets

Install the package

pip install git+https://github.com/AI4SCR/ai4bmr-datasets.git

conda install conda-forge::r-base # optional, if R is not already installed

```

🚀 Quickstart

This section provides a brief guide to get started with the package.

Note on Memory Requirements

Please be aware that storing the datasets can consume a significant amount of memory. The approximate requirements are as follows: - Keren2018: ~10GB - Jackson2020: ~180GB - Danenberg2022: ~66GB - Cords2024: ~785GB

1. Initialize and Prepare a Dataset

First, import the desired dataset and initialize it. You can specify a base_dir to control where the data is stored. If not provided, it defaults to ~/.cache/ai4bmr-datasets or the path set in the AI4BMR_DATASETS_DIR environment variable. We recommend configuring the environment variables in a .env file for convenience.

```python from ai4bmr_datasets import Keren2018 from pathlib import Path

Define the base directory for storing the dataset

basedir = None # use defaults basedir = Path('path/to/data/') # define your own path

Initialize the dataset

dataset = Keren2018(basedir=basedir, imageversion="published", maskversion="published", featureversion="published", loadintensity=True, metadataversion="published", loadmetadata=True )

Download and preprocess the data. This only needs to be run once.

dataset.prepare_data() ```

2. Setup and Access Data

After preparing the data, use the setup() method to load the data. By setting load_intensity and load_metadata to True, you can access cell-level features and clinical data. You can list the available data versions using list_{image, mask, metadata, features}_versions().

```python

List available data versions

dataset.listimageversions() dataset.listmaskversions() dataset.listmetadataversions() dataset.listfeaturesversions()

Setup the dataset with a specific version and load features

dataset.setup()

Access core components of the dataset

print("Sample IDs:", dataset.sample_ids) print("Available images:", list(dataset.images.keys())) print("Available masks:", list(dataset.masks.keys())) ```

3. Work with Image and Mask Objects

Image and mask objects are loaded lazily. The data is only loaded into memory when you access the .data attribute.

```python import matplotlib.pyplot as plt

Get the first sample ID

sampleid = dataset.sampleids[0]

Access the image and mask data

img = dataset.images[sampleid].data mask = dataset.masks[sampleid].data

print(f"Image shape for sample {sampleid}: {img.shape}") print(f"Mask shape for sample {sampleid}: {mask.shape}")

panel = dataset.panel idx = panel.target == 'ds_dna'

Visualize the image and mask for the DNA channel

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5)) axs[0].imshow(img[idx].squeeze()) axs[1].imshow(mask > 0) for ax, title in zip(axs, ['channel', 'masks']): ax.settitle(title) ax.setaxisoff() ax.figure.tightlayout() ax.figure.show()

```

4. Access Cell-Level Features and Metadata

Once loaded, you can directly access the dataframes for intensity, metadata, and clinical data.

```python

Access the intensity matrix (cells x markers)

print("Intensity matrix shape:", dataset.intensity.shape)

Access the metadata (cells x annotations)

print("Metadata shape:", dataset.metadata.shape)

Access the clinical data (patients x annotations)

print("Clinical data shape:", dataset.clinical.shape) ```

🤝 Contributing

We welcome contributions to improve the project, including:

Adding new datasets
Fixing bugs
Improving performance or usability

📌 Contribution Guidelines

If you'd like to contribute:

Fork the repository and create a feature branch.
Add or update relevant tests if applicable.
Submit a pull request with a clear description of your changes.

Please open a discussion or issue before implementing large features to ensure alignment with project goals.

🐛 Reporting Issues

If you encounter bugs, incorrect behavior, or missing functionality:

Open an issue with a clear title and minimal reproducible example.
Include relevant error messages and version information if possible.

We monitor issues regularly and appreciate detailed, constructive reports.

💬 Seeking Support

If you need help using the software:

Check the existing issues and discussions first.
For general questions or usage advice, feel free to open a discussion topic.
For dataset-specific problems, include dataset identifiers and loading code if applicable.

We aim to foster a welcoming and respectful community. Please be kind and collaborative in your interactions.

📬 Contact

For questions, issues, or contributions, feel free to contact:

Adriano Martinelli
📧 adriano.martinelli@chuv.ch
📧 adrianom@student.ethz.ch

Owner

Name: AI for Single Cell Research
Login: AI4SCR
Kind: organization
Email: art@zurich.ibm.com
Location: Switzerland

Repositories: 8
Profile: https://github.com/AI4SCR

JOSS Publication

SpatialProteomicsNet: A unified interface for spatial proteomics data access for computer vision and machine learning

Published

March 06, 2026

DOI

10.21105/joss.08780

Volume 11, Issue 119, Page 8780

Authors

Adriano Martinelli

University Hospital Lausanne (CHUV), Lausanne, Switzerland, ETH Zurich, Zurich, Switzerland, Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland

Marianna Rapsomaniki

University Hospital Lausanne (CHUV), Lausanne, Switzerland, University of Lausanne (UNIL), Lausanne, Switzerland, Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland

Editor

Ujjwal Karn

CodeMeta (codemeta.json)

{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@type": "Code",
  "author": [
    {
      "@id": "https://orcid.org/0000-0002-9288-5103",
      "@type": "Person",
      "email": "adriano.martinelli@chuv.ch",
      "name": "Adriano Martinelli",
      "affiliation": "ETH Zurich"
    }
  ],
  "identifier": "TODO",
  "codeRepository": "https://github.com/AI4SCR/ai4bmr-datasets",
  "datePublished": "2025-06-06",
  "dateModified": "2025-06-06",
  "dateCreated": "2025-06-06",
  "description": "a lightweight Python package that provides standardized access to curated spatial proteomics datasets (IMC, MIBI) for machine learning and computer vision",
  "keywords": "spatial omics,imaging mass cytometry,spatial proteomics",
  "license": "MIT",
  "title": "SpatialOmicsNet",
  "version": "v1.0.0"
}

GitHub Events

Total

Issues event: 4
Watch event: 3
Delete event: 8
Issue comment event: 4
Push event: 114
Public event: 1
Pull request review event: 3
Pull request event: 19
Fork event: 2
Create event: 13

Last Year

Issues event: 4
Watch event: 3
Delete event: 8
Issue comment event: 4
Push event: 114
Public event: 1
Pull request review event: 3
Pull request event: 19
Fork event: 2
Create event: 13

Dependencies

pyproject.toml pypi

fastparquet *
matplotlib *
numpy *
openpyxl *
pandas *
pyarrow *
pydantic *
pydantic-settings *
python-dotenv *
scikit-image *
tifffile *
torch *
tqdm *

ai4bmr-datasets

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

SpatialProteomicsNet

📦 Supported Datasets

🧱 Data Model

Data Versions

⚙️ Installation

Prerequisites

Environment Setup

Using venv

Create a virtual environment

Install the package

Using conda

Create a conda environment

Install the package

conda install conda-forge::r-base # optional, if R is not already installed

🚀 Quickstart

1. Initialize and Prepare a Dataset

Define the base directory for storing the dataset

Initialize the dataset

Download and preprocess the data. This only needs to be run once.

2. Setup and Access Data

List available data versions

Setup the dataset with a specific version and load features

Access core components of the dataset

3. Work with Image and Mask Objects

Get the first sample ID

Access the image and mask data

Visualize the image and mask for the DNA channel

4. Access Cell-Level Features and Metadata

Access the intensity matrix (cells x markers)

Access the metadata (cells x annotations)

Access the clinical data (patients x annotations)

🤝 Contributing

📌 Contribution Guidelines

🐛 Reporting Issues

💬 Seeking Support

📬 Contact

Owner

JOSS Publication

SpatialProteomicsNet: A unified interface for spatial proteomics data access for computer vision and machine learning

Authors

Editor

Tags

CodeMeta (codemeta.json)

GitHub Events

Total

Last Year

Dependencies

Using `venv`

Using `conda`