singledeep

Code to run singleDeep for phenotype prediction of samples from scRNA-Seq data

https://github.com/genyo-bioinformatics/singledeep

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Code to run singleDeep for phenotype prediction of samples from scRNA-Seq data

Basic Info

Host: GitHub
Owner: GENyO-BioInformatics
License: gpl-3.0
Language: Python
Default Branch: main
Size: 15.1 MB

Statistics

Stars: 6
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created almost 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

singleDeep

A deep learning workflow for samples phenotype prediction with single-cell RNA-Seq data.

Dependencies

SingleDeep has been tested with R 4.2.0 and Python 3.8.3, although it is likely to work with other versions. To install the Python dependencies with pip, run the following code.

{bash} pip install -r requirements.txt

Preparing the data

The starting point of singleDeep is a Seurat or scanpy object with the scRNA-Seq data processed and the cell populations annotated. The file format should be RData or h5ad for Seurat and scanpy objects respectively. There are several resources to process scRNA-Seq data, for instance the excellent book Single-cell best practices (https://www.sc-best-practices.org/). Scaling is optional, but it is important to know whether it has been performed or not, since it has to be indicated to the singleDeep main script (parameter scale). To reduce the computational cost, it is strongly recommended to perform some gene selection during the data preprocessing. For instance, it is common to select a number of highly variable genes.

You can run the R script PrepareData.R from the terminal to read the input file and prepare all the necessary data to run singleDeep. It is strongly recommended to read the script help to understand the parameters and be used adequately. You can access this help with the following command.

{bash} Rscript PrepareData.R --help

Among the script parameters, there are two specially relevant:

maxCells: If a cell type has a large amount of cells (e.g., > 100,000), there is no benefit for the neural networks training and the computational cost is significantly increased. For this reason, it is recommended to use this parameter to discard random cells until reaching the specified value (50,000 by default, but usually less cells can be used without affecting the models performance).
filterGenes: If this parameter is included, ribosomal, mitochondrial and non-coding genes will be discarded from the data. These genes are usually not measured correctly in single-cell experiments and may be problematic for data analysis and interpretation. Therefore, we recommend to include this parameter unless there is some specific interest for these genes.

Here is an example of how to run the script for a scanpy object:

```{bash} Rscript PrepareData.R --inputPath dataset.h5ad --fileType scanpy --sampleColumn indcov --clusterColumn cgcov --clinicalColumns Disease --maxCells 30000 --outPath ./singleDeep_input

```

Training and internal validation

This is the main step of the singleDeep workflow. Check carefully the parameters of singleDeep.py, since many of them are essential for the analyses to be run correctly. This script trains a deep learning model for each cell population previously annotated. Internal validation is performed with a nested cross-validation, with the inner loop used to calculate the performance of the cluster (MCC parameter) and the outer loop performs the actual training and prediction for the individual cells. Based on these per-cell predictions, a per-sample prediction is performed for each cell population. A final prediction considering all the cell populations is calculated, and the performance metrics are calculated comparing these predictions and the real labels provided.

To read the script help, run the following command:

{bash} python singleDeep.py --help

Following the example of PrepareData.R, this script would be executed with default parameters using the following command.

{bash} python singleDeep.py --inPath ./singleDeep_input

The output files, saved by default in the folder results, are the following:

testResults.tsv: Mean performance metrics for the test samples (outer fold of the nested cross-validation)
clusterResults.tsv: Mean performance metrics for the individual cell populations. This is useful to know which clusters are the most informative for the classification
folds_performance: Folder with performance metrics for each cell population and individual external fold. These tables give information about the performance consistency along the different data folds in each cell population
gene_contributions: Folder with the contributions of each gene in each sample and cell population. Read the DeepLIFT article for more information (https://arxiv.org/abs/1704.02685)
models.pt (only if --saveModel parameter is passed): File necessary for external validation

Training reports

SingleDeep saves interactive reports that can be used to track the training and validation processes. It is necessary to install TensorFlow to open these reports (check https://www.tensorflow.org/install). Once installed, the reports may be opened with this command:

{bash} tensorboard --logdir ./log

The reports to be plotted may be selected in the left panel. The structure of the reports names for the inner cross-validation are cell population + fold K out + fold K in. For instance, the first inner fold for the second outer fold and B cell population would be Bfold21. The outer cross-validation is named as test (e.g. Bfold3test would contain the results for the third fold of the outer loop). Furthermore, if the sameModel parameter of the singleDeep.py script is True, reports finishing with whole are generated, containing the training metrics for the entire dataset.

Reports contain the MCC, accuracy and cross-entropy loss for both the test and training along the training epochs.

Using the trained models with external data

If the parameter saveModel of singleDeep.py is set to True, the neural networks for each cell population are trained with the whole dataset and saved to the models.pt file, together with other necessary data. To use these models to predict the phenotype of a new dataset, it is necessary to project the cells of this dataset to the training dataset. This step is essential to apply the model of each cell population to the corresponding cells. There are several methods to do this step, such as scmap (https://bioconductor.org/packages/release/bioc/html/scmap.html) in R or the scanpy's function tl.ingest (https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.ingest.html). Furthermore, according to our experience, it is also necessary to correct the batch effect between the datasets with methods such as Combat or Harmony, as well as scaling together the datasets and selecting the variable genes from the merged data.

The function PrepareData.R may be used on this new dataset following the previous instructions. Once these data are ready, the script singleDeep_pretrained.py should be used to perform the prediction. You can check the help with the following code:

{bash} python singleDeep_pretrained.py --help

Essentially, the required parameters are inputPath, the path to the new dataset (output of PrepareData.R); modelFile, the object with the trained models (models.pt from singleDeep.py); scale to indicate if scaling should be performed; and outFile, the file to save the predictions. Take into account that singleDeep transforms the phenotype categories into numbers (0, 1...) following the alphabetical order. For instance, if the caregories are "Control" and "Disease", the output will be 0 for Control and 1 for Disease respectively.

Use example

We created a toy dataset comprising 20 samples, each containing 5 distinct cell types under 2 different conditions. You can generate this dataset using the simulation_toy_dataset.R script. To run the singleDeep model on this simulated data, use the following command:

{bash} python singleDeep.py --inPath ./toy_dataset/data --sampleColumn Sample --logPath log_toy --resultsPath results_toy --varColumn Condition --num_epochs 50 --resultsFilenames Condition --KOuter 3 --KInner 2

Please note that you may encounter 'zerodivision' warnings, which are expected if all samples are predicted with the same label. The process will take a few minutes to complete, after which the results will be available in the `resultstoy` folder.

Owner

Name: GENyO BioInformatics
Login: GENyO-BioInformatics
Kind: organization
Email: bioinfo@genyo.es
Location: Granada, Andalucia, Spain

Website: http://www.genyo.es/en/bioinformatics-unit
Repositories: 6
Profile: https://github.com/GENyO-BioInformatics

Next Generation Sequencing pipelines, workflows and tools to analyse the data derived from it. We are mainly focused on Cancer and Autoinmune diseases.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Martorell-Marugán"
  given-names: "Jordi"
  orcid: "https://orcid.org/0000-0002-5186-0735"
- family-names: "López-Domínguez"
  given-names: "Raúl"
  orcid: "https://orcid.org/0000-0001-8634-117X"
- family-names: "Villatoro-García"
  given-names: "Juan Antonio"
  orcid: "https://orcid.org/0000-0001-5752-7784"
- family-names: "Toro-Domínguez"
  given-names: "Daniel"
  orcid: "https://orcid.org/0000-0001-8440-312X"
- family-names: "Chierici"
  given-names: "Marco"
  orcid: "https://orcid.org/0000-0001-9791-9301"
- family-names: "Jurman"
  given-names: "Giuseppe"
  orcid: "https://orcid.org/0000-0002-2705-5728"
- family-names: "Carmona-Sáez"
  given-names: "Pedro"
  orcid: "https://orcid.org/0000-0002-6173-7255"
title: "singleDeep"
version: 1.0.0
doi: 10.6084/m9.figshare.26136463
date-released: 2024-08-28
url: "https://github.com/GENyO-BioInformatics/singleDeep"

GitHub Events

Total

Watch event: 6

Last Year

Watch event: 6

Dependencies

requirements.txt pypi

captum ==0.6.0
numpy ==1.23.0
pandas ==1.5.3
scanpy ==1.9.6
scikit-learn ==1.3.1
tensorboard ==2.9.0
torch ==1.11.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

singledeep

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

singleDeep

Dependencies

Preparing the data

Training and internal validation

Training reports

Using the trained models with external data

Use example

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies