https://github.com/brickmanlab/tapseq_workflow

A Snakemake workflow for TAP-seq data processing

https://github.com/brickmanlab/tapseq_workflow

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

A Snakemake workflow for TAP-seq data processing

Basic Info
  • Host: GitHub
  • Owner: brickmanlab
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 17.6 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of argschwind/TAPseq_workflow
Created over 3 years ago · Last pushed over 3 years ago

https://github.com/brickmanlab/TAPseq_workflow/blob/master/

# TAP-seq workflow

To run the pipeline on DANGPU, see this [file instead](DANGPU.md).

This repository contains a [snakemake](https://snakemake.readthedocs.io/en/stable/index.html)
workflow that handles all data processing from fastq files to transcript counts and perturbation
status matrices from TAP-seq experiments.

The workflow can be downloaded by simply cloning the repository into a location of choice:

```bash
git clone https://github.com/argschwind/TAPseq_workflow.git
```

The workflow can be executed through snakemake and conda. This only requires that conda and
snakemake are installed. All other required dependencies will be installed through conda.

The data processing workflow consists of following steps:

## 1. Create TAP-seq alignment references

TAP-seq uses custom alignment references with added CROP-seq vector transcripts to the identify
perturbation of each cell. The rules in workflow `rules/create_alignment_refs.smk` allow creation of
such aligment references. The workflow supports references containing either the whole transcriptome
or only TAP-seq target gene loci. The type of the alignment reference is provided by its name. The 
alignment reference name is made according to following pattern: **`species_type_suffix`**. A
reference named `hg38_tapseq_ref` therefore specifies a TAP-seq, i.e. only containing target gene
loci based on the human genome hg38. `ref` serves as a suffix to distinguish different references of
the same type, e.g. references for different perturbation or target gene pools.

Alignment references for all samples are specifed in the first part of the `config.yml` file.
Parameters for reference creation are also found in the config file. CROP-seq vector and target
gene lists can be provided via files in `meta_data/cropseq_vectors` and
`meta_data/target_gene_panels`. Which lists are used for each specified alignment reference is
defined in the config file under: `step 1: create alignment reference`. See example files and
reference to learn more about this.

This example workflow specifies two human alignment references with the same CROP-seq vectors, but
one containing the whole transcriptome and one only example TAP-seq target gene loci. Alignment
references for e.g. mouse could be created by changing/adding urls for mm10 in
`download_genome_annot` in `config.yml` (for whole transcriptome) and changing/adding the mm10
BSgenome object in `create_tapseq_ref` (for target genes only). Using the species tag in alignment
reference names enables creating references for multiple species/genomes within one project.

All alignment references used by the workflow can be created using the following command. The
`--sjdbOverhang` STAR parameter might have to be adjusted in the `create_genomedir` section in the
config file.

```bash
# create all aligment references defined in the config file (--jobs = number of threads to use in
# parallel, please adjust; -n = dryrun, remove it to execute)
snakemake --use-conda --jobs 2 alignment_references -n
```

This uses parallel computing, but the number of available threads of course depends on your system.
By default STAR uses up to 5 threads, but this can be adjusted in the `create_genomedir` and
`star align` sections of the config file. Providing 10 cores would therefore mean that 2 processes
can be run entirely in parallel with each STAR process using 5 threads each.

## 2. Align reads

Reads can be aligned to created references using the snakemake rules in `rules/align_reads.smk`.
This is based on the [Drop-seq tools](http://mccarrolllab.org/dropseq/) workflow. Input paired end
fastq files for every sample are specified under `samples` in the config file. This assumes that
files for each sample are located in one directory and follow the naming scheme
`prefix_sample_1_sequence.txt.gz` and `prefix_sample_2_sequence.txt.gz` for read 1 and read 2 files.
This naming scheme will likely have to be adapted by changing the `get_fastq_files` function in
`rules/align_reads.smk`.

The cell number and alignment reference for each sample is specified by `cell_numbers` and
`align_ref` in the config file. Parameters for each workflow step can also be changed via the config
file.

Reads for all samples can be aligned by running following command.  This also creates an alignment
report for every sample in `results/alignment`.

```bash
# align reads for all samples
snakemake --use-conda --jobs 2 align_reads -n
```

## 3. Extract digital gene expression (DGE)

The last step of data processing consists of extracting transcript counts and the perturbation
status for each cell in every sample. The main output of this step are `dge.txt` and
`perturbation_status.txt` files for every sample containing the transcript counts and detected
CROP-seq vector perturbations per cell. This also applies chimeric read filtering as proposed by
[Dixit et al., 2016](https://www.biorxiv.org/content/10.1101/093237v1.full).

DGE data can be extracted for all samples using following command, which also creates DGE reports in
`results/dge`.

```bash
snakemake --use-conda --jobs 2 extract_dge -n
```

## 4. Executing the whole workflow

The whole workflow can be exectuted for all samples at once using the snakemake "all" rule by simply
running:

```bash
snakemake --use-conda
```

Owner

  • Name: Brickman group
  • Login: brickmanlab
  • Kind: organization
  • Location: Copenhagen

Professor Joshua Brickman at Center for Stem Cell Medicine (reNEW), University of Copenhagen

GitHub Events

Total
Last Year