mu-big-data-and-data-science-image-pathology-pipeline
Trabajo Fin del Máster en Ciencias de Datos y Big Data
https://github.com/sanchezis/mu-big-data-and-data-science-image-pathology-pipeline
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Repository
Trabajo Fin del Máster en Ciencias de Datos y Big Data
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
MSc - Thesis in Data Science and Big Data
Generation of a Distributed Pipeline for Feature Extraction in Pathological Medical Images
Table of Contents
General Information
This project aims to implement a pipeline where each step performs an information extraction and data mining operation from sources originating from pathological images.
Likewise, various libraries will be used, and their possibilities will be compared, as if they were executed in a real medical environment.
Technologies
Prerequisites
The following technologies must be installed or have access to them locally or in a distributed system.
- Spark cluster version ^3.4.4
- Python 3.11
- Java JDK version 11.0.24
- Scala version 2.13
- Libraries: pyspark (3.4.4), pylint(3.1.0), numpy (^1.20.2), pandas (^2.0.0)
- OpenSlide
- TiaToolbox, HistomicsTK, StarDist
Setup
Install Dependencies
Spark can be installed locally by installing code dependencies or by following the script on the machine where you want to have the environment.
⚠️ This Spark environment must be configured to use AWS! You must download the JAR files to connect. If using Spark version 3.4.4, these are the files:
aws-java-sdk-bundle-1.12.262.jarhadoop-aws-3.3.4.jarAnd they must be saved in the Spark JAR directory.
```bash
Descarga Spark 3.4.4
wget https://archive.apache.org/dist/spark/spark-3.4.4/spark-3.4.4-bin-hadoop3.tgz
Descomprime
tar -xvzf spark-3.4.4-bin-hadoop3.tgz sudo mv spark-3.4.4-bin-hadoop3 /opt/spark
Configura variables de entorno
echo 'export SPARKHOME=/opt/spark' >> ~/.bashrc echo 'export PATH=$PATH:$SPARKHOME/bin' >> ~/.bashrc source ~/.bashrc ```
You can install the necessary Python requirements using poetry.
bash
pip install poetry
poetry install
These commands will install all dependencies used in this project.
Code Preparation
All the following fragments must be executed successfully.
The project has a framework called digital_pathology which has an associated directory of unit and integration tests that, when executed, maintain proper functioning of the functionalities executed in each feature extraction stage.
- Unit Testing
bash
poetry run pytest tests/unit
- Integration Testing
bash
poetry run pytest tests/integration
- Code style and best practices check execution
```bash poetry run mypy --ignore-missing-imports --disallow-untyped-calls --disallow-untyped-defs --disallow-incomplete-defs digital_pathology tests
poetry run pylint data_transformations tests ```
Execution
The project follows the following directory structure.
bash
/
├─ /digital_pathology # Contains the main Python library
│ # with the code for the transformations
│
├─ /jobs # Contains the entry points to the jobs
│ # performs argument parsing, and is
│ # passed to `spark-submit`
│
├─ /notebooks # Contains the notebooks for databricks
│ # implementation and pipeline execution in cluster/cloud
│ # environment
│
├─ /tests
│ ├─ /units # contains basic unit tests for the code
│ └─ /integration # contains integration tests for the jobs
│ # and the setup
│
├─ .gitignore
├─ .pylintrc # configuration for pylint
├─ LICENCE
├─ poetry.lock
├─ pyproject.toml
└─ README.md # The current file
The tests in the repository are performed autonomously and appear as completed through the corresponding badge. To execute each process, run the run.sh script or follow the instructions below.
⚠️ It can be executed locally!
bash
poetry build && poetry run spark-submit \
--master local \
--py-files dist/digital_pathology-*.whl \
jobs/<JOB_STEP>.py \
<INPUT_FILE_PATH> \
<OUTPUT_PATH>
- JOBSTEP: pipeline step which performs a specific feature extraction process, image processing, and data mining, possible values can be
download,ingest, `preprocess, orextract`. - INPUTFILEPATH: If necessary, you can add the input pathological image data source.
- OUTPUT_PATH: Output directory where all executed steps are stored.
Creating a Package for Production Environment Execution
Running the script
bash
scripts/build.sh
Will create a dist folder with the following files:
bash
/
...
│
├─ /dist
│ ├─ /digital_pathology-@VERSION.tar # zipped Python library to be used for executors
│ └─ /digital_pathology-@VERSION-py3-none-any # library wheels for setup and installation
│
...
Which can be used in any real or production implementation, whether local, cloud, or cluster.
License
Universidad Internacional de Valencia
Owner
- Name: Israel Llorens
- Login: sanchezis
- Kind: user
- Location: Valencia
- Website: https://www.linkedin.com/in/israel-llorens-68845438
- Repositories: 1
- Profile: https://github.com/sanchezis
Principal Lead Data Scientist, Predictive Analytics, Machine Learning and Big Data Software Engineer
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Llorens
given-names: Israel
orcid: "https://orcid.org/0009-0006-4029-9977"
title: "Medical Image Data Bank - Generación de una tubería distribuida para la extracción de características en imágenes médicas patológicas"
version: 2.0.4
identifiers:
- type: doi
value: None
date-released: 2024-04-14
url: "https://github.com/sanchezis/MU-Big-Data-and-Data-Science-Image-Pathology-Pipeline"
GitHub Events
Total
- Watch event: 1
- Push event: 74
Last Year
- Watch event: 1
- Push event: 74