https://github.com/carmonalab/scib-pipeline

Snakemake pipeline that works with the scIB package to benchmark data integration methods.

https://github.com/carmonalab/scib-pipeline

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Snakemake pipeline that works with the scIB package to benchmark data integration methods.

Basic Info
  • Host: GitHub
  • Owner: carmonalab
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 232 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Fork of theislab/scib-pipeline
Created over 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme License

README.md

Pipeline for benchmarking supervised integration of single-cell RNA-seq atlases

This repository contains the snakemake pipeline for our benchmarking analysis of scRNA-seq data integration tools. It is based on the package scib and the previous scib-pipeline from the Theis Lab Luecken et al, 2020.

Compared to this previous study our benchmark focus on (semi) supervised tools with the addition of our new version of STACAS. It includes our new integration metric CiLISI we computed with our scIngrationMetrics R package. It also assesses how well (semi)-supervised integration tools are robust to noise we introduce with shuffling of cell type labels and partial annotations. It also tests the capacity of (semi) supervised tools to separate cell type when they are guided with a broader annotation.

Major modifications regarding the original pipeline

  • Adding STACAS and semi-supervised STACAS
  • Using embedding output (PCA computed on scaled integrated data with Seurat) for R based methods
  • Embedding/latent space for integration with a size fixed (e.g. 30, 50) for all tools (reduced space or bottleneck layer of autoencoders)
  • Testing noisy and missing cell type labels to guide integration (i.e. partially removing and shuffling cell type labels)
  • New batch-correction metric CiLISI (cell type-aware LISI) computed with scIngrationMetrics
  • Packages of integration tools updated

Installation

As in the original pipeline, to reproduce the results from this study, two separate conda environments are needed for python and R operations. Please make sure you have either mambaforge or conda installed on your system to be able to use the pipeline. We recommend using mamba, which is also available for conda, for faster package installations with a smaller memory footprint.

We provide python and R environment YAML files in envs/, together with an installation script for setting up the correct environments in a single command. based on the R version you want to use. Our new pipeline currently only supports R 4.1 Call the script as follows

shell bash envs/create_conda_environments.sh -r 4.1

Once installation is successful, you will have the python environment scib-pipeline-R<version> and the R environment scib-R<version> that you must specify in the config file.

| R version | Python environment name | R environment name | Test data config YAML file | |------------------|------------------|------------------|-------------------| | 4.1 | scib-pipeline-R4.1 | scib-R4.1 | configs/test_data-R4.1.yml |

Running the Pipeline

This repository contains a snakemake pipeline to run integration methods and metrics reproducibly for different data scenarios preprocessing setups.

Setup Configuration File {#setup-configuration-file}

The parameters and input files are specified in config files. A description of the config formats and example files can found in configs/. You can use the example config that use the test data to get the pipeline running quickly, and then modify a copy of it to work with your own data.

Pipeline Commands

To call the pipeline on the test data e.g. using R 4.1 to reproduce our benchmarking with the original annotations to guide supervised tools:

shell snakemake --configfile configs/test_original_annotations-R4.1.yaml -n

This gives you an overview of the jobs that will be run. In order to execute these jobs with up to 10 cores, call

shell snakemake --configfile configs/test_original_annotations-R4.1.yaml --cores 10

We strongly recommand to use this snakemake on a HPC cluster e.g. using slurm and the config file configs/cluster.yml you can run the workflow as follow:

shell mkdir -p cluster/snakemake/; \ snakemake -j 100 --configfile configs/test_original_annotations-R4.1.yaml \ --cluster-config configs/cluster.yml \ --cluster "sbatch -A {cluster.account} \ -p {cluster.partition} \ -N {cluster.N} \ -t {cluster.time} \ --job-name {cluster.name} \ --mem {cluster.mem} \ --cpus-per-task {cluster.cpus-per-task}\ --output {cluster.output} \ --error {cluster.error}"

Then you can generate a table gathering the snakemake benchmark files (cpu time, memory usage...)

shell snakemake --configfile configs/test_original_annotations-R4.1.yaml --cores 1 benchmarks

More snakemake commands can be found in the documentation.

Reproduce/Visualize our results

We provide the config files to reproduce our 3 different analyses together with the rmarkdown we used to generate the figures from the results of the pipeline that you can find on the results directory

| Analysis | config YAML file | Rmarkdown file | |------------------|-------------------------|-----------------------------| | original annotations | testoriginalannotations-R4.1.yml | originalAnnotationAnalysis.Rmd | | robustness to noise | testsupervisedmethods-R4.1.yml | SupervisedToolAnalysis.Rmd | | final benchmark | testfinalbenchmark-R4.1.yml | finalBenchmarkAnalysis.Rmd |

Failed integration

Some tools fail to integrate certain task, in order to complete the workflow and set NA to integration metrics for these scenarios you can use the script integration_fail_file.py as follow:

python scripts/integration_fail_file.py  -c configs/review_tests.yaml -t Pancreas_rm8 -l unknown_15_shuffled_20 -m seuratrpca -v

Tools

Tools that are compared include: - STACAS - Scanorama - - scANVI - - FastMNN - - scGen - - scVI - Seurat v4 (CCA and RPCA) - - Harmony

References

Benchmarking atlas-level data integration in single-cell genomics. Luecken et al, 2020

Semi-supervised integration of single-cell transcriptomics data. Andreatta et al, 2023

Owner

  • Name: Cancer Systems Immunology Lab
  • Login: carmonalab
  • Kind: organization
  • Location: Lausanne, Switzerland

At Ludwig Cancer Research Lausanne and Department of Oncology, University of Lausanne & Swiss Institute of Bioinformatics

GitHub Events

Total
  • Watch event: 1
  • Fork event: 1
Last Year
  • Watch event: 1
  • Fork event: 1

Dependencies

.github/workflows/pipeline.yml actions
  • actions/checkout v2 composite
  • conda-incubator/setup-miniconda v2 composite
tests/requirements.txt pypi
  • pytest * test
  • pytest-icdiff * test
  • pytest-runner * test