deconvolution_benchmarking
https://github.com/medicalgenomicslab/deconvolution_benchmarking
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 18 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: MedicalGenomicsLab
- License: gpl-3.0
- Language: Jupyter Notebook
- Default Branch: master
- Size: 8.64 MB
Statistics
- Stars: 18
- Watchers: 3
- Forks: 2
- Open Issues: 1
- Releases: 4
Metadata Files
README.md
Deconvolution benchmarking
Source code to reproduce results described in the article Performance of tumour microenvironment deconvolution methods in breast cancer using single-cell simulated bulk mixtures submitted to Nature Communications on 31st July 2022.
DOI:
- System requirements
- Experiment structure
- Prepare training and test data
- Method execution instructions
System requirements
We generated results for this study using Python/3.6.1, R/3.5.0 on, and R/4.0.2 on CenOS Linux 7 (Core). Below are the dependencies for the Python and R languages.
Python dependencies
- cuda 10.1
- scaden 1.1.2
- pandas 1.1.5
- numpy 1.19.5
- anndata 0.6.22.post1
- scanpy 1.4.4.post1
- gtfparse 1.2.0
- plotly 4.1.0
- smote-variants 0.4.0
- blackcellmagic 0.0.2
R dependencies
The method DWLS requires R/3.5.0 environment while all other R-based methods require R/4.0.2 environment.
R/3.5.0 dependencies
- quadprog 1.5.5
- reshape 0.8.8
- e1071 1.6.8
- Seurat 2.3.3
- ROCR 1.0.7
- varhandle 2.0.5
- MAST 1.6.0
R/4.0.2 dependencies
- BisqueRNA 1.0.5
- Biobase 2.50.0
- TED 1.4
- scBio 1.4
- Seurat 4.0.3
- foreach 1.5.1
- doSNOW 1.0.19
- parallel 4.0.2
- raster 3.4.13
- fields 12.5
- limma 3.46.0
- LiblineaR 2.10.12
- sp 1.4.5
- EPIC 1.1.5
- jsonlite 1.7.2
- hspe 0.1
- MuSiC 0.2.0
- xbioc 1.7.2
Singularity dependencies
The method CIBERSORTx does not provide source code and can only be executed via a Docker container. As Docker requires root access, which could not be run on our infrastructure, we converted the CIBERSORTx Docker images to a Singularity image. This requires: - Docker 20.10.17 - Singularity 3.6.1
Experiment structure
There are 10 experiments with the overall directory structure as follows:
.
├── 01_purity_levels_experiment/
│ ├── include_normal_epithelial/
│ │ └── data/
│ └── exclude_normal_epithelial/
│ └── data/
├── 02_normal_epithelial_lineages_experiment/
│ └── data/
├── 03_immune_lineages_experiment/
│ ├── minor_level/
│ └── data/
│ └── subset_level/
│ └── data/
├── 04_tcga_bulk_validation/
│ └── data/
├── 05_external_scrna_validation/
│ ├── bassez_et_al/
│ └── data/
│ └── pal_et_al/
│ └── data/
├── 06_batch_effect_validation/
│ ├── bassez_et_al/
│ └── data/
│ └── pal_et_al/
│ └── data/
├── params/
├── src/
├── .gitignore
└── README.md
Each experiment is compacted within its own directory with its own data and execution files to run each deconvolution method either as a Jupyter notebook or a .psb job.
Each experiment needs a data/ directory, which should should contain:
``` data/ ├── Wholeminiatlasmeta*.csv ├── Wholeminiatlasumap.coords.tsv ├── training/ ├── scRNAref.h5ad (single-cell reference profiles) ├── trainingsimmixts.h5ad (training simulated RNA-Seq data, only for Scaden) ├── test/ └── testsimmixts.h5ad (test simulated bulk RNA-seq data) ├── pretrainedscaden/ (pretrained Scaden models. Please refer to "Scaden" section) ├── expectresults/ (expected results. Please refer to "Collect test results" section) └── other experiment-specific files
``` The project is 12 Terabytes (TB) in size. For data enquiries, please contact Nic Waddell (corresponding author) and Khoa Tran (first author). We will work with you to determine the best way to share the data.
Prepare training and test data
Each deconvolution requires either single-cell reference profiles or simulated mixtures as training data to estimate cell populations in the test_sim_mixts.h5ad files. From the scRNA_ref.h5ad and training_sim_mixts.h5ad files in training/, users can prepare method-specific training data by running the Jupyter notebook 9_1_prepare_method_specific_data.ipynb in each experiment directory. Once finished, method-specific sub-folders will be created in the data/ directory:
```
data/
├── bisque/
├── bprism/
├── cbx/
├── cpm/
├── dwls/
├── epic/
├── hspe/
├── music/
├── scaden/
....
```
NOTE: You will need at least 4 CPUs and 275GiB memory to run this step for each experiment.
Method execution instructions
Besides CIBERSORTx and Scaden, .psb run scripts of all methods point to the R code in src/. CIBERSORTx execution is based on Singularity, and Scaden execution is based on the bash command scaden (please refere to System requirements).
CPUs and memory requirements for each job are already specified in corresponding .pbs files.
NOTE: each method could produce different sets of results depending on the specified parameter. For example, BayesPrism enables users to optionally input marker genes and cell states (cell subtypes), or to use only marker genes. Choosing to use each of these functionality will produce a different set of results, meaning the results folder in the .pbs script for each method needs to be adjusted accordingly.
BisqueRNA (bisque)
Data files:
logged_scRNA_ref.csv: bisque requires the single-cell reference data to be loggedphenotypes.csv: cell labels. Matches the order in which cells are listed inlogged_scRNA_ref.csv
Running:
- Replace user email in PBS -M option
- Modify WORK_DIR and SRC_DIR paths in 1_run_bisque.pbs file
- Run method as pbs job
BayesPrism (bprism)
Data files:
scRNA_ref.csv: single-cell reference profiles. Columns are cell ids and rows are gene symbolssingle_cell_labels.csv: single cell labels. Matches the order in which cells are listed inscRNA_ref.csv
Running:
- Replace user email in PBS -M option
- Modify WORK_DIR and SRC_DIR paths in 1_run_bprism.pbs file
- (Optional) Increase number of available CPUs in the PBS -l option and N_CPUS variable.
- Run method as pbs job
CIBERSORTx (CBX)
Data files:
scRNA_ref.txt: the single-cell reference matrix. Columns are cells and column names are cell types. CBX infers cell types using column names and therefore does not need a single-cell labels files like other methods.test_counts_*_pur_lvl.txt: as I mentioned above, CBX requires the bulk expressions to be in its directory.
Running:
- Replace user email in PBS -M option
- Modify WORK_DIR in 1_run_cbx.pbs file
- Copy test_counts_<PUR_LVL>_pur_lvl.txt files from test/ to cbx/
- Download the fractions_latest.sif file from Google Drive. This is the same container provided by CIBERSORTx. We simply converted the image from Docker to Singularity, as Docker requires sudo access and Singularity does not
- Update Singularity commands
- Path to the Singularity image fractions_latest.sif
- --token with the token requested from cibersortx.stanford.edu
Cell Population Mapping (CPM)
Data:
We slightly modified the CPM source code to split the execution in half. The first half uses 48 CPUs and 300-500Gi of memory. The second half only uses 4 CPUs and 100Gi of memory.
scRNA_ref_1330_per_ctype_*.txt: single-cell reference data. We only use 1,330 or less per cell type.cell_state.csv: the cell-state space that CPM uses to map how cells transition from one cell type to another. We use UMAP coordinates as cell-state space.single_cell_label.csv: cell labels. Matches the order in which cells are listed inscRNA_ref_1330_per_ctype_*.txt
Running:
- Replace user email in PBS -M option
- Modify WORK_DIR and SRC_DIR paths in 1_run_cpm_1.pbs and 1_run_cpm_2.pbs files
- Update path to the 1_CPM_functions.R file in src/1_run_cpm_1.R and src/1_run_cpm_2.R
- Run 1_run_cpm_1.pbs first as a pbs job
- Wait for 1_run_cpm_1.pbs to finish. Then run src/1_run_cpm_2.pbs as PBS job
DWLS
Data:
scRNA_ref.txt: the single-cell reference matrixsingle_cell_labels.csv: cell labels. Matches the order in which cells are listed in single-cell reference
Running:
- Replace user email in PBS -M option
- Modify WORK_DIR and SRC_DIR paths in 1_run_dwls_1.pbs and 1_run_dwls_2.pbs files
- Run method as pbs job. We restructured the original DWLS into 2 stages: 1_run_dwls_1.pbs and 1_run_dwls_2.pbs, using the src files 1_run_dwls_1.R and 1_run_dwls_2.R respectively.
- 1_run_dwls_1.pbs runs differential expression analysis using either seurat or mast and generated cell-type-specific .RData files
- 1_run_dwls_2.R collects differentially expressed genes produced in the previous step and conducts deconvolution.
We also provide the pbs script 1_run_dwls_all.pbs as an alternaitve should any users prefer to run DWLS in a stageless fashsion.
EPIC
Data:
EPIC is the only method that requires a pre-built signature matrix instead of single-cell reference. We provided EPIC with the signature matrix that CBX generated during its execution. This is the reason why you see data files for EPIC in the cbx_sig_matrix folder.
reference_profiles.csv: signature matrix that CBX producedmarker_gene_symbols.csv: marker genes among all genes in the signature matrix. EPIC weights these genes higher non non-marker genes. In our study, this is merely a procedural necessity as we list all genes in the signature matrix as marker genes.
Running:
- Replace user email in PBS -M option
- Modify WORK_DIR and SRC_DIR paths in 1_run_epic.pbs file
- Run method as pbs job
hspe
Data:
scRNA_ref.csv: the single-cell reference matrixpure_samples.json: cell labels in a dictionary. Matches the order in which cells are listed in single-cell referencelogged_test_counts_*_pur_lvl.txt: as mentioned above, hspe requires the bulk expressions to be logged.
Running:
- Replace user email in PBS -M option
- Modify WORK_DIR and SRC_DIR paths in 1_run_hspe.pbs file
- Run method as pbs job
MuSiC
Data:
scRNA_ref.csv: the single-cell reference matrixpheno.csv: cell labels and patient ids of cells in the single-cell reference. Matches the order in which cells are listed in single-cell reference
Running:
- Replace user email in PBS -M option
- Modify WORK_DIR and SRC_DIR paths in 1_run_music.pbs file
- Run method as pbs job
Scaden
Data:
train_counts.h5ad: Scaden is the only methods that requires simulated bulk mixtures as training data. It also requires these mixtures to be compressed into an AnnData (.h5ad) object.test_counts_*_pur_lvl.txt: as I mentioned above, CBX requires the bulk expressions to be in its directory.
Running:
- Replace user email in PBS -M option
- Modify WORK_DIR in 1_run_scaden.pbs file
- Copy test_counts_<PUR_LVL>_pur_lvl.txt files from test/ folder into scaden/ folder
NOTE 1:
scaden is built using the tensorflow library and requires cuda 10.1 to run. In this study, we installed scaden and cuda 10.1 in 2 separate environments, reflected in the .pbs script as:
source /software/scaden/scaden-0.9.4-venv/bin/activate
module load gpu/cuda/10.1
This means without CUDA 10.1 driver, Scaden will default to using CPUs instead of GPUs, which takes significantly longer..
NOTE 2:
The Scaden method is a neural network and therefore produces slightly different results when trained from scratch. However, results from a newly trained model should be similar to results generated by the pre-trained models.
For testing Scaden pre-trained models:
- Edit the 01_run_scaden.pbs script to point to data/pretrained_scaden/ instead of data/scaden/
Copy test_counts_<PUR_LVL>_pur_lvl.txt files from test/ folder into pretrained_scaden/ folder
- Comment out scaden train in 01_run_scaden.pbs. scaden process should produce identical processed data files compared to data/scaden
- Run 01_run_scaden.pbs
Collect test results
Once all methods have been executed, you can collect results from all methods using the 2_collate_test_results.ipynb Jupyter notebook in each experiment directory.
Simply update the prefix variable at the top of each notebook with the correct path to the experiment folder and run each notebook from top to bottom. After that, all methods' results will be collated into the data/results folder.
In addition, you can also find all results in this study in data/expected_results/. Each expected_results/ sub-folder of each experiment should contain:
└── data/
└── expect_results/
├── bisque.csv
├── bprism.csv
├── cbx.csv
├── cpm.csv
├── dwls.csv
├── epic.csv
├── hspe.csv
├── music.csv
├── scaden.csv
└── truth.csv
Owner
- Name: MedicalGenomicsLab
- Login: MedicalGenomicsLab
- Kind: organization
- Repositories: 1
- Profile: https://github.com/MedicalGenomicsLab
Citation (CITATION.cff)
cff-version: 2.0.0 message: "Please cite this code as follows." authors: - family-names: "Tran" given-names: "Khoa A." - family-names: "Addala" given-names: "Venkateswar" - family-names: "Addala" given-names: "Rebecca L." - family-names: "Johnston" given-names: "Venkateswar" - family-names: "Lovell" given-names: "David" - family-names: "Bradley" given-names: "Andrew" - family-names: "Koufariotis" given-names: "Lambros T." - family-names: "Wood" given-names: "Scott" - family-names: "Wu" given-names: "Sunny Z." - family-names: "Roden" given-names: "Dan" - family-names: "Al-Eryani" given-names: "Ghamdan" - family-names: "Swarbrick" given-names: "Alexander" - family-names: "Addala" given-names: "Venkateswar" - family-names: "Williams" given-names: "Elizabeth D." - family-names: "Williams" given-names: "Elizabeth D." - family-names: "Pearson" given-names: "John V." - family-names: "Kondrashova" given-names: "Olga" - family-names: "Waddell" given-names: "Nicola" title: "Performance of tumour microenvironment deconvolution methods in breast cancer using single-cell simulated bulk mixtures" version: 2.0.0 doi: 10.5281/zenodo.8237541 date-released: 2023-08-11 url: "https://github.com/MedicalGenomicsLab/deconvolution_benchmarking"
GitHub Events
Total
- Issues event: 1
- Watch event: 13
- Issue comment event: 1
Last Year
- Issues event: 1
- Watch event: 13
- Issue comment event: 1