mu-big-data-and-data-science-image-pathology-pipeline

Trabajo Fin del Máster en Ciencias de Datos y Big Data

https://github.com/sanchezis/mu-big-data-and-data-science-image-pathology-pipeline

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Trabajo Fin del Máster en Ciencias de Datos y Big Data

Basic Info

Host: GitHub
Owner: sanchezis
License: other
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 18.5 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation Authors

MSc - Thesis in Data Science and Big Data

Generation of a Distributed Pipeline for Feature Extraction in Pathological Medical Images

Israel Llorens

General Information
Technologies
Setup
Execution

General Information

This project aims to implement a pipeline where each step performs an information extraction and data mining operation from sources originating from pathological images.

Likewise, various libraries will be used, and their possibilities will be compared, as if they were executed in a real medical environment.

Technologies

Prerequisites

The following technologies must be installed or have access to them locally or in a distributed system.

Spark cluster version ^3.4.4
Python 3.11
Java JDK version 11.0.24
Scala version 2.13
Libraries: pyspark (3.4.4), pylint(3.1.0), numpy (^1.20.2), pandas (^2.0.0)
OpenSlide
TiaToolbox, HistomicsTK, StarDist

Setup

Install Dependencies

Spark can be installed locally by installing code dependencies or by following the script on the machine where you want to have the environment.

⚠️ This Spark environment must be configured to use AWS! You must download the JAR files to connect. If using Spark version 3.4.4, these are the files: aws-java-sdk-bundle-1.12.262.jar hadoop-aws-3.3.4.jar And they must be saved in the Spark JAR directory.

```bash

Descarga Spark 3.4.4

wget https://archive.apache.org/dist/spark/spark-3.4.4/spark-3.4.4-bin-hadoop3.tgz

Descomprime

tar -xvzf spark-3.4.4-bin-hadoop3.tgz sudo mv spark-3.4.4-bin-hadoop3 /opt/spark

Configura variables de entorno

echo 'export SPARKHOME=/opt/spark' >> ~/.bashrc echo 'export PATH=$PATH:$SPARKHOME/bin' >> ~/.bashrc source ~/.bashrc ```

You can install the necessary Python requirements using poetry.

bash pip install poetry poetry install

These commands will install all dependencies used in this project.

Code Preparation

All the following fragments must be executed successfully.

The project has a framework called digital_pathology which has an associated directory of unit and integration tests that, when executed, maintain proper functioning of the functionalities executed in each feature extraction stage.

Unit Testing

bash poetry run pytest tests/unit

Integration Testing

bash poetry run pytest tests/integration

Code style and best practices check execution

```bash poetry run mypy --ignore-missing-imports --disallow-untyped-calls --disallow-untyped-defs --disallow-incomplete-defs digital_pathology tests

poetry run pylint data_transformations tests ```

Execution

The project follows the following directory structure.

bash / ├─ /digital_pathology # Contains the main Python library │ # with the code for the transformations │ ├─ /jobs # Contains the entry points to the jobs │ # performs argument parsing, and is │ # passed to `spark-submit` │ ├─ /notebooks # Contains the notebooks for databricks │ # implementation and pipeline execution in cluster/cloud │ # environment │ ├─ /tests │ ├─ /units # contains basic unit tests for the code │ └─ /integration # contains integration tests for the jobs │ # and the setup │ ├─ .gitignore ├─ .pylintrc # configuration for pylint ├─ LICENCE ├─ poetry.lock ├─ pyproject.toml └─ README.md # The current file The tests in the repository are performed autonomously and appear as completed through the corresponding badge. To execute each process, run the run.sh script or follow the instructions below.

⚠️ It can be executed locally!

bash poetry build && poetry run spark-submit \ --master local \ --py-files dist/digital_pathology-*.whl \ jobs/<JOB_STEP>.py \ <INPUT_FILE_PATH> \ <OUTPUT_PATH>

JOBSTEP: pipeline step which performs a specific feature extraction process, image processing, and data mining, possible values can be download, ingest, `preprocess, orextract`.
INPUTFILEPATH: If necessary, you can add the input pathological image data source.
OUTPUT_PATH: Output directory where all executed steps are stored.

Creating a Package for Production Environment Execution

Running the script

bash scripts/build.sh

Will create a dist folder with the following files:

bash / ... │ ├─ /dist │ ├─ /digital_pathology-@VERSION.tar # zipped Python library to be used for executors │ └─ /digital_pathology-@VERSION-py3-none-any # library wheels for setup and installation │ ...

Which can be used in any real or production implementation, whether local, cloud, or cluster.

License

EUPL

Universidad Internacional de Valencia

Owner

Name: Israel Llorens
Login: sanchezis
Kind: user
Location: Valencia

Website: https://www.linkedin.com/in/israel-llorens-68845438
Repositories: 1
Profile: https://github.com/sanchezis

Principal Lead Data Scientist, Predictive Analytics, Machine Learning and Big Data Software Engineer

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Llorens
    given-names: Israel
    orcid: "https://orcid.org/0009-0006-4029-9977"
title: "Medical Image Data Bank - Generación de una tubería distribuida para la extracción de características en imágenes médicas patológicas"
version: 2.0.4
identifiers:
  - type: doi
    value: None
date-released: 2024-04-14
url: "https://github.com/sanchezis/MU-Big-Data-and-Data-Science-Image-Pathology-Pipeline"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science