sckb

Use kallisto and bustools to call the gene-cell matrix for both spliced and unspliced transcripts of scRNA-seq

https://github.com/hsu-che-wei/sckb

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Use kallisto and bustools to call the gene-cell matrix for both spliced and unspliced transcripts of scRNA-seq

Basic Info

Host: GitHub
Owner: Hsu-Che-Wei
License: mit
Language: Shell
Default Branch: master
Size: 104 MB

Statistics

Stars: 4
Watchers: 2
Forks: 4
Open Issues: 1
Releases: 0

Created over 6 years ago · Last pushed about 4 years ago

Metadata Files

Readme License Citation

scKB

Use kallisto and bustools to call the gene-by-cell matrix for both spliced and unspliced transcripts of scRNA-seq

Download

In Linux:

``` git clone https://github.com/Hsu-Che-Wei/scKB.git

cd ./scKB

unzip whitelist files and gene annotation file

unzip 10xv2whitelist.txt.zip unzip 10xv3whitelist.txt.zip gunzip Arabidopsis_thaliana.TAIR10.43.gtf.gz

make scKB executable

chmod 777 scKB ```

Preparation

Install "kallisto" and "bustools" in your Unix/Linux environment :

The easiest way is through "conda". If you are new to conda, follow this nice tutorial. conda install -c bioconda bustools conda install -c bioconda kallisto It is recommended that you create a new conda environment for only the use of scKB : ``` # create the environment named scKB, which will have kallisto, bustools and R (version 3.6.1) installed conda env create -f scKB.yml

# activate the environment conda activate scKB

# start the R session R ``` 2. Install package "BUSpaRse" and "BSgenome" in your R environment (tested R version = 3.6.1) :

The two packages are hosted in "Bioconductor", make sure your are in R and type :

``` if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!require(devtools)) install.packages("devtools")

devtools::install_github("BUStools/BUSpaRse") BiocManager::install("BSgenome") ``` For macOS user, if you encounter any error when installing BUSpaRse, then pay this site a visit.

Prepare your genome file (.fasta) and annotation file (.gtf) :

You will need to convert your fasta into a BSgenome object. In case the genome you want to use is already available as a BSgenome object, simply install it from Bioconductor and load it :

In R :

``` # Install Arabidopsis TAIR genome BiocManager::install("BSgenome.Athaliana.TAIR.TAIR9")

# Install Rice MSU genome BiocManager::install("BSgenome.Osativa.MSU.MSU7")

# Load the genomes as BSgenome object library(BSgenome.Athaliana.TAIR.TAIR9) library(BSgenome.Osativa.MSU.MSU7)

# Check the genomes BSgenome.Athaliana.TAIR.TAIR9

# Arabidopsis genome: # organism: Arabidopsis thaliana (Arabidopsis) # provider: TAIR # provider version: TAIR9 # release date: June 9, 2009 # release name: TAIR9 Genome Release # 7 sequences: # Chr1 Chr2 Chr3 Chr4 Chr5 ChrM ChrC
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given # sequence)

BSgenome.Osativa.MSU.MSU7

# Rice genome: # organism: Oryza sativa (Rice) # provider: MSU # provider version: MSU7 # release date: October 31, 2011 # release name: MSU7 Genome Release # 16 sequences: # Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9 Chr10 Chr11 Chr12 ChrM ChrC ChrUn ChrSy
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given # sequence) ``` Make sure that the name of chromosomes/sequence is exactly the same in annotation file (.gtf) and the BSgenome object!

If the genome is not available as BSgenome object or you have added new sequences manually to the genome in fasta format, then you will need to convert the fasta file into BSgenome object following this manual (check out section 2.2.5 and 2.3).

Extract the intron information and make kallisto index

An example showing how to extract intron information and prepare kallisto index for Arabidopsis singel cell sample prepared using "10X v3 chemistry" is provided here.

In R : ``` setwd("~/to/where/you/clone/the/repo/scKB")

# Load the package and genome library(BUSpaRse) library(BSgenome.Athaliana.TAIR.TAIR9)

# X = directory to your annotation file, L set to 91 if you are using 10X v3 chemistry, 98 if you are using 10X v2 chemistry getvelocityfiles(X = "./Arabidopsisthaliana.TAIR10.43.gtf", L = 91, Genome = BSgenome.Athaliana.TAIR.TAIR9, outpath = "./", isoformaction = "separate", chrsonly=FALSE, style="Ensembl")

# After running getvelocityfiles, the working directory should have files "cDNAintrons.fa", "cDNAtxtocapture.txt", "intronstxto_capture.txt" and "tr2g.tsv"

# Index the intron file with kallisto system("kallisto index -i ./cDNAintrons10xv3.idx ./cDNA_introns.fa")

quit() ```

scKB

In Unix/Linux environment, make sure that both "kallisto" and "bustools" can be found and executed.

Usage :

``` scKB [-h] | [-f fastqdir -i intronindexdir -d intermediatefilesouputdir -s singlecelltechniquesupported -t numberofthreadsused -w whitelistdir -n resultsoutput_dir]

-h (--help) = software usage -f (--file) = directory to fastq files -i (--index) = directory to intron index files (includes the .idx file itself) -d (--dirintermediate) = directory to intermediate files output -s (--supportedtech) = single cell technique supported

short name description

10xv1 10x version 1 chemistry 10xv2 10x version 2 chemistry 10xv3 10x version 3 chemistry CELSeq CEL-Seq CELSeq2 CEL-Seq version 2 DropSeq DropSeq inDrops inDrops SCRBSeq SCRB-Seq SureCell SureCell for ddSEQ

-t (--thread) = number of threads used -w (--whitelist) = directory to whitelist file (includes the text file itself) -n (--dir_final) = directory to final output that includes gene cell matrix ``` In Linux:

``` cd ~/to/where/you/clone/the/repo/scKB

run scKB, it is recommended to name the output directory as sample name using "-n" flag if one expect to use COPILOT with default parameters for downstream preprocessing

./scKB -f ./toydata -i ./cDNAintrons10xv3.idx -d ./ -s 10xv3 -t 8 -w ./10xv3whitelist.txt -n ./col0_toy

After running scKB, the output directory (in this case "./coltoy") should have 8 files: "inspect.json", "runinfo.json", "spliced.barcodes.txt", "spliced.genes.txt", "spliced.mtx", "unspliced.barcodes.txt", "unspliced.genes.txt" and "unspliced.mtx"

json files have the reads/aligning stats of the sample, mtx files are gene-by-cell matrices, txt files have cell barcodes and gene ids information

```

Caution!

1. Under the directory that you submit to -f, make sure that you have sequences of at least one run (I1 (optional), R1 and R2 gzipped fastq). Multiple runs of the same sample should be put under the same directory. Each run of scKB will call the gene-by-cell matrix for one sample.

2. The directory to intron file (see Preparation step 4), index file (-i), whitelist (-w) and intermediate files (-d) can be the same one, yet the final output directory (-n) should be different.

3. Only demonstration of processing 10X version 3 chemistry data is provided here, and only the whitelist files for 10X version 2 and version 3 are provided, yet, the techniques listed in usage are also supported.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

sckb

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

scKB

Download

unzip whitelist files and gene annotation file

make scKB executable

Preparation

scKB

run scKB, it is recommended to name the output directory as sample name using "-n" flag if one expect to use COPILOT with default parameters for downstream preprocessing

After running scKB, the output directory (in this case "./coltoy") should have 8 files: "inspect.json", "runinfo.json", "spliced.barcodes.txt", "spliced.genes.txt", "spliced.mtx", "unspliced.barcodes.txt", "unspliced.genes.txt" and "unspliced.mtx"

json files have the reads/aligning stats of the sample, mtx files are gene-by-cell matrices, txt files have cell barcodes and gene ids information

Owner

GitHub Events

Total

Last Year