https://github.com/broadinstitute/single_cell_classification

Methods to use SNPs or gene expression to classify single cell RNAseq to reference profiles

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Methods to use SNPs or gene expression to classify single cell RNAseq to reference profiles

Basic Info

Host: GitHub
Owner: broadinstitute
Language: R
Default Branch: master
Size: 6.65 MB

Statistics

Stars: 29
Watchers: 3
Forks: 8
Open Issues: 0
Releases: 0

Created over 6 years ago · Last pushed about 6 years ago

Metadata Files

Readme

singlecellclassification

Methods to use SNPs or gene expression to classify single cell RNAseq to reference profiles

These methods were used in the paper: 'Multiplexed single-cell profiling of post-perturbation transcriptional responses to define cancer vulnerabilities and therapeutic mechanism of action'

Running singlecellclassification:

This method was run on macOS High Sierra v10.13.6 using RStudio version 1.2.5033 and R version 3.6.2 (2019-12-12).

clone or download the git repo
install the R package dependencies using the install_packages.R script
open singlecellclassification as your project root directory
To run classifications use source(here::here('src', 'runSNPclassification.R')); runSNPclassification()
To run the QC methods use source(here::here('src', 'runQC.R')); runall_QC()

runSNPclassification() and runallQC() read in data from the folder 'data' within the singlecellclassification folder. There is a test data set of 588 cells originating from 5 cell lines. The test data includes the cell barcodes, gene list, expression matrix, and SNP ref and alt allele count matrices for the single cell data and the SNP ref and alt allele count matrices for the 5 reference profiles. On the test data provided runSNPclassification() and runallQC() each take <1 min to run.

Files needed:

These scripts are intended to run on 10X single cell RNAseq processed using Cell Ranger (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger).

For runSNPclassification, these files are needed:

bulk_ref.csv
bulk_alt.csv
sc_ref.csv
sc_alt.csv
barcodes.tsv

Matrices of reference and alternate allele counts at given SNP sites are needed for reference cell lines and single cell data. The methods assume that the alt and ref matrices are samples x SNP sites. For bulk RNAseq, freebayes (https://github.com/ekg/freebayes) was run with forced calling at a set of 100,000 SNP sites, then we combined the freebayes vcf file for each sample using loaddatahelpers.R/combinebulkreference_profiles to create ref and alt allele count matrices. For the single cell RNAseq, ref and alt allele matrices were produced using the method scAlleleCount (https://github.com/barkasn/scAlleleCount). The single cell barcodes, as output from the cellranger count method, are also required.

To run the QC scripts or geneexpressionclassification these files, as output from the cellranger count method, are needed:

matrix.mtx
genes.tsv
barcodes.tsv
classifications.csv - which is the output of runSNPclassification()

Output:

runSNPclassification, using defaults, will output a matrix containing the reference sample classifications for each cell, as well as other classification metrics, such as:

singlet_ID : the most likely reference sample classification for that cell NA,
singlet_dev : fraction of deviance explained by the top reference sample
singletdevz : (zscored) fraction of deviance explained by the top reference sample
singlet_margin : difference between the fraction of deviance explained by the most likely reference sample and the second most likely reference sample
singletzmargin : difference between the (zscored) fraction of deviance explained by the most likely reference sample and the second most likely reference sample
doubletzmargin : difference between the (zscored) fraction of deviance explained by the second most likely reference sample and the third most likely reference sample
doubletdevimp : the difference between the fraction of deviance explained by the doublet model and the fraction of deviance explained by the singlet model (measure of whether it is a doublet or singlet)
doublet_CL1 : the most likely reference cell line if this cell is a doublet
doublet_CL2 : the next most likely reference cell line if this cell is a doublet
tot_reads : total reads in this cell at the SNP sites used for classification
num_SNPs : number of SNPs detected in this cell of the SNP sites used for classification

run_QC outputs a Seurat object and includes cell quality classifications for each cell (in the meta.data object), classifying each cell as:

normal : cells that are used for downstream analyis
doublet : cell is more likely a multiplet, discared for downstream analysis
low quality : cell is low quality in terms of RNAseq quality metrics or in terms of our ability to classify it as one of the reference samples, discarded for downstream analysis
empty droplet : cells, with distinct gene expression profiles, and SNP profiles that did not match to any reference cell line (or pairwise combination of cell lines) in particular, but rather resembled more a mixture of SNPs from all the in-pool cell lines, suggesting these are empty droplets containing ambient mRNA in the pool, discarded for downstream analysis
low confidence : cells that could not be confidently classified as any of the reference samples, discarded for downstream analysis

geneexpressionclassification can be used for comparison to SNP based classifications, but is not recommended for primary classification.

Owner

Name: Broad Institute
Login: broadinstitute
Kind: organization
Location: Cambridge, MA

Website: http://www.broadinstitute.org/
Twitter: broadinstitute
Repositories: 1,083
Profile: https://github.com/broadinstitute

Broad Institute of MIT and Harvard

GitHub Events

Total

Watch event: 3

Last Year

Watch event: 3

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: 12 days
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.5
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/broadinstitute/single_cell_classification

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

singlecellclassification

Running singlecellclassification:

Files needed:

Output:

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels