https://github.com/broadinstitute/celligner_ms

Code related to the Celligner manuscript

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: biorxiv.org
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
✓
Institutional organization owner
Organization broadinstitute has institutional domain (www.broadinstitute.org)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary

Last synced: 7 months ago · JSON representation

Repository

Code related to the Celligner manuscript

Basic Info

Host: GitHub
Owner: broadinstitute
Language: R
Default Branch: master
Size: 156 KB

Statistics

Stars: 46
Watchers: 7
Forks: 22
Open Issues: 4
Releases: 2

Created about 6 years ago · Last pushed over 5 years ago

Metadata Files

Readme

Celligner_ms

This repo contains code associated with the manuscript describing Celligner, our method for aligning tumor and cell line transcriptional profiles.

Data

The data associated with this analysis is available from public data repositiories, supplementary data files associated with the manuscript (https://www.biorxiv.org/content/10.1101/2020.03.25.008342v1.supplementary-material), and in the figshare: https://figshare.com/articles/Celligner_data/11965269.

The cell line data used as input can be found at depmap.org (the file is DepMap Public 19Q4 CCLEexpressionfull.csv)

The tumor data used as input is from the treehouse dataset, available here: https://xenabrowser.net/datapages/?dataset=TumorCompendiumv10PolyAhugolog2tpm58581genes2019-07-25.tsv&host=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

Organization of repo

The code can be organized into config files, helper functions, and analysis/figure generation scripts.

NOTE: The functions can be run using data available with the manuscript and data from publicly available resources (primarily depmap.org and xena browser)

Configs

global_params.R: Define global params shared across analysis scripts. Includes parameters used to run Celligner alignment and parameters used for creating plots.

Helper functions

analysis_helpers.R: Define helper functions used throughout the analysis and creation of figures
Celligner_helpers.R : Define helper functions used for the Celligner alignment method

Analysis/fig-gen

Celligner_methods.R : Functions to run the various stages and entire Celligner alignment method
There are separate scripts for each of the main and supplementary figure panels within the manuscript.

Running Celligner

This method was run on macOS High Sierra v10.13.6 using RStudio version 1.2.5033 and R version 3.6.2 (2019-12-12).

Download/clone the repo

Download or clone this repo and open as a new project in RStudio.

Install dependencies:

Use the install_packages.R script to install the necessary R packages to run the methods.

Download the necessary data:

Data files should be stored in the directory passed to runCelligner(). There are 4 files needed to run Celligner, by default the files are named: - TCGAmat.tsv - CCLEmat.csv - Cellignerinfo.csv - hgnccompleteset_7.24.2018.txt

TCGAmat.tsv is the matrix of log2(TPM+1) expression values for the tumor samples. The file used in the paper can be download from XenaBrowser: https://xenabrowser.net/datapages/?dataset=TumorCompendiumv10PolyAhugolog2tpm58581genes2019-07-25.tsv&host=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 (this file should be renamed TCGAmat.tsv to use the default naming).

CCLEmat.csv is the matrix of log2(TPM+1) expression values for the cell line samples. The file used in the paper is the DepMap Public 19Q4 'CCLEexpressionfull.csv' file, which can be dowloaded from depmap.org: https://depmap.org/portal/download (this file should be renamed CCLEmat.csv to use the default naming).

Cellignerinfo.csv is a matrix of sample info, which can be downloaded from the Figshare repo here: https://figshare.com/articles/Cellignerdata/11965269. This file contains the sample names for the tumors and cell lines, as well as the information such as the cancer lineage, subtype, primary vs metastatic status, and tumor purity of the samples. These features are used for plotting the data, but not for the Celligner method itself. If this file is not provided than a default matrix will be created using the row names of TCGAmat and CCLEmat as the sampleIDs.

hgnccompleteset7.24.2018.txt is a table of gene ids, and is used to convert between HGNC gene IDs and Ensembl IDs. The version of this matrix used in the paper can be downloaded from the Figshare repo here: https://figshare.com/articles/Cellignerdata/11965269. This file was downloaded from HGNC, and the latest version of the file can be downloaded from here: ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnccompleteset.txt (using this version will change the genes used).

Running the method:

The runCelligner() method (found in Cellignermethods.R) combines all steps of the Celligner method. It loads the data, finds differentially expressed genes, runs contrastive principal components analysis, runs mutual nearest neighbors batch correction, and creates a Seurat object containing the aligned data and a 2D UMAP projection of the aligned data.

Running the run_Celligner() method, with default parameters takes <1 hr, while running with the exact parameters used in the manuscript (setting the global parameter fastPCA to NULL) can take ~10 hours.

Using the output:

runCelligner() outputs a Seurat object (named combobj), which is used to package the data and run dimensionality reduction methods. To learn more about Seurat, see here: https://satijalab.org/seurat/. To access various information in the Seurat object use these commands - To get the celligner aligned output: Seurat::GetAssayData(combobj) - To get the metadata: combobj@meta.data - To get the coordinates for the 2D UMAP projection: Seurat::Embeddings(combobj, reduction ='umap') - To use Seurat to plot the results (colored by cancer lineage): Seurat::DimPlot(combobj, reduction = 'umap', group.by = 'lineage', pt.size = 0.5) + ggplot2::theme(legend.position = 'none')

Notes:

By default the global parameter fastcPCA is set to 10, which reduces the time for calculating the contrastive principal components (cPCs) by estimating a calculation of only the top contrastive principal components needed for the method (by the default the methods only uses the top four cPCs, so the parameter fastcPCA can be set to any number >=4). To recreate the exact output from the paper set the global parameter fast_cPCA to NULL so that all the contrastive principal components (cPCs) will be calculated. This is quite slow.
If using your own data (not the data recommended above) you will need to write your own loaddata method. Later methods assume that the matrix TCGAmat is sample x gene matrix, where the rows are the tumor sample IDs and the columns are Ensembl gene IDs, the matrix CCLEmat is sample x gene matrix, where the rows are the cell line sample IDs and the columns are Ensembl gene IDs, and that the TCGAann and CCLEann matrices output by loaddata have the columns sampleID, lineage, subtype, and Primary/Metastasis (these columns aren't used for the method, just for plotting the results - sampleID needs to match the row names of TCGAmat and CCLEmat, but the other columns can be set to NA without affecting the results).

Owner

Name: Broad Institute
Login: broadinstitute
Kind: organization
Location: Cambridge, MA

Website: http://www.broadinstitute.org/
Twitter: broadinstitute
Repositories: 1,083
Profile: https://github.com/broadinstitute

Broad Institute of MIT and Harvard

GitHub Events

Total

Watch event: 3
Fork event: 1

Last Year

Watch event: 3
Fork event: 1

Committers

Last synced: 10 months ago

All Time

Total Commits: 29
Total Committers: 1
Avg Commits per committer: 29.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Allie Warren	a**n@b**g	29

Committer Domains (Top 20 + Academic)

broadinstitute.org: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 9
Total pull requests: 0
Average time to close issues: 2 months
Average time to close pull requests: N/A
Total issue authors: 7
Total pull request authors: 0
Average comments per issue: 1.22
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

zenglongjin (2)
PietroD (2)
zhenzhenyang-psu (1)
Optimistix (1)
ghost (1)
levinyi (1)
p-smirnov (1)

https://github.com/broadinstitute/celligner_ms

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Celligner_ms

Data

Organization of repo

Configs

Helper functions

Analysis/fig-gen

Running Celligner

Download/clone the repo

Install dependencies:

Download the necessary data:

Running the method:

Using the output:

Notes:

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels