https://github.com/carmonalab/scib-pipeline-2
Snakemake pipeline that works with the scIB package to benchmark data integration methods.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Snakemake pipeline that works with the scIB package to benchmark data integration methods.
Basic Info
- Host: GitHub
- Owner: carmonalab
- License: mit
- Default Branch: main
- Size: 228 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of carmonalab/scib-pipeline
Created over 3 years ago
· Last pushed over 3 years ago
https://github.com/carmonalab/scib-pipeline-2/blob/main/
# Pipeline for benchmarking atlas-level single-cell integration This repository contains the snakemake pipeline for our benchmarking study for data integration tools. In this study, we benchmark 16 methods ([see here](##tools)) with 4 combinations of preprocessing steps leading to 68 methods combinations on 85 batches of gene expression and chromatin accessibility data. The pipeline uses the [`scib`](https://github.com/theislab/scib.git) package and allows for reproducible and automated analysis of the different steps and combinations of preprocesssing and integration methods.  ## Resources - On our [website](https://theislab.github.io/scib-reproducibility) we visualise the results of the study. - The scib package that is used in this pipeline can be found [here](https://github.com/theislab/scib). - For reproducibility and visualisation we have a dedicated repository: [scib-reproducibility](https://github.com/theislab/scib-reproducibility). - The data used in the study on [figshare](https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-_integration_task_datasets_Immune_and_pancreas_/12420968) ### Please cite: Luecken, M.D., Bttner, M., Chaichoompu, K. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods 19, 4150 (2022). https://doi.org/10.1038/s41592-021-01336-8 ## Installation To reproduce the results from this study, two separate conda environments are needed for python and R operations. Please make sure you have either [`mambaforge`](https://github.com/conda-forge/miniforge) or [`conda`](https://conda.io/projects/conda) installed on your system to be able to use the pipeline. We recommend using [`mamba`](https://mamba.readthedocs.io), which is also available for conda, for faster package installations with a smaller memory footprint. We provide python and R environment YAML files in `envs/`, together with an installation script for setting up the correct environments in a single command. based on the R version you want to use. The pipeline currently supports R 3.6 and R 4.0, and we generally recommend using version R 4.0. Call the script as follows e.g. for R 4.0 ```shell bash envs/create_conda_environments.sh -r 4.0 ``` Check the script's help output in order to get the full list of arguments it uses. ```shell bash envs/create_conda_environments.sh -h ``` Once installation is successful, you will have the python environment `scib-pipeline-R` and the R environment `scib-R ` that you must specify in the [config file](#setup-configuration-file). | R version | Python environment name | R environment name | Test data config YAML file | |-----------|-------------------------|--------------------|------------------------------| | 4.0 | `scib-pipeline-R4.0` | `scib-R4.0` | `configs/test_data-R4.0.yml` | | 3.6 | `scib-pipeline-R3.6` | `scib-R3.6` | `configs/test_data-R3.6.yml` | > **Note**: The installation script only works for the environments listed in the table above. > The environments used in our [study][paper] are included for reproducibility purposes and are described in `envs/`. For a more detailed description of the environment files and how to install the different environments manually, please refer to the README in `envs/`. ## Running the Pipeline This repository contains a [snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline to run integration methods and metrics reproducibly for different data scenarios preprocessing setups. ### Generate Test data A script in `data/` can be used to generate test data. This is useful, in order to ensure that the installation was successful before moving on to a larger dataset. The pipeline expects an `anndata` object with normalised and log-transformed counts in `adata.X` and counts in `adata.layers['counts']`. More information on how to use the data generation script can be found in `data/README.md`. ### Setup Configuration File The parameters and input files are specified in config files. A description of the config formats and example files can found in `configs/`. You can use the example config that use the test data to get the pipeline running quickly, and then modify a copy of it to work with your own data. ### Pipeline Commands To call the pipeline on the test data e.g. using R 4.0 ```shell snakemake --configfile configs/test_data-R4.0.yaml -n ``` This gives you an overview of the jobs that will be run. In order to execute these jobs with up to 10 cores, call ```shell snakemake --configfile configs/test_data-R4.0.yaml --cores 10 ``` More snakemake commands can be found in the [documentation](snakemake.readthedocs.io/). ### Visualise the Workflow A dependency graph of the workflow can be created anytime and is useful to gain a general understanding of the workflow. Snakemake can create a `graphviz` representation of the rules, which can be piped into an image file. ```shell snakemake --configfile configs/test_data-R3.6.yaml --rulegraph | dot -Tpng -Grankdir=TB > dependency.png ```  ## Tools Tools that are compared include: - [Scanorama](https://github.com/brianhie/scanorama) - [scANVI](https://github.com/chenlingantelope/HarmonizationSCANVI) - [FastMNN](https://bioconductor.org/packages/batchelor/) - [scGen](https://github.com/theislab/scgen) - [BBKNN](https://github.com/Teichlab/bbknn) - [scVI](https://github.com/YosefLab/scVI) - [Seurat v3 (CCA and RPCA)](https://github.com/satijalab/seurat) - [Harmony](https://github.com/immunogenomics/harmony) - [Conos](https://github.com/hms-dbmi/conos) [tutorial](https://htmlpreview.github.io/?https://github.com/satijalab/seurat.wrappers/blob/master/docs/conos.html) - [Combat](https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.combat.html) [paper](https://academic.oup.com/biostatistics/article/8/1/118/252073) - [MNN](https://github.com/chriscainx/mnnpy) - [TrVae](https://github.com/theislab/trvae) - [DESC](https://github.com/eleozzr/desc) - [LIGER](https://github.com/MacoskoLab/liger) - [SAUCIE](https://github.com/KrishnaswamyLab/SAUCIE) [paper]: https://doi.org/10.1038/s41592-021-01336-8
Owner
- Name: Cancer Systems Immunology Lab
- Login: carmonalab
- Kind: organization
- Location: Lausanne, Switzerland
- Website: https://agora-cancer.ch/laboratory/carmona-lab
- Twitter: carmonation
- Repositories: 16
- Profile: https://github.com/carmonalab
At Ludwig Cancer Research Lausanne and Department of Oncology, University of Lausanne & Swiss Institute of Bioinformatics