https://github.com/baerlachlan/smk-rnaseq-counts

Snakemake workflow for estimating read counts from RNA-seq data

https://github.com/baerlachlan/smk-rnaseq-counts

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Snakemake workflow for estimating read counts from RNA-seq data

Basic Info
  • Host: GitHub
  • Owner: baerlachlan
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 1.95 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 2
  • Open Issues: 0
  • Releases: 21
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme License

README.md

Snakemake workflow for estimating read counts from RNA-seq data

This Snakemake workflow implements the pre-processing steps to estimate gene- and transcript-level read counts from raw RNA-seq data.

Contents

Workflow summary

1. Raw data

Raw RNA-seq data is expected in FASTQ format. No specific location is required for the raw data. This is left flexible for the user, however the chosen location must be specified in config/config.yaml.

2. Trim

Trimming is performed with fastp

Module: trim.smk

  • Input: Raw data at chosen location (FASTQ)
  • Output: results/trim (FASTQ)

3. Merge (optional)

Merging involves concatenating files of the same sample. This step will only be performed when multiple sequencing units are specified for the same sample identifier in config/units.tsv. A common example for multiple sequencing units is when a sample is split across multiple lanes.

Module: merge.smk

  • Input: results/trim (FASTQ)
  • Output: results/merge (FASTQ)

4. Align

Aligment to the genome is performed with STAR.

Module: align.smk

  • Input: results/trim and/or results/merge (FASTQ)
  • Output: results/align (BAM)

5. Deduplicate (optional)

Deduplication using Unique Molecular Identifiers (UMIs) and mapping position is performed with UMI-tools. This is currently only available for gene-level counts with featureCounts.

Module: deduplicate.smk

  • Input: results/align (BAM)
  • Output: results/deduplicate (BAM)

6. Gene-level counts (optional)

Summarisation of read counts to the gene-level is performed with featureCounts.

Module: featureCounts.smk

  • Input: results/align or results/deduplicate (BAM)
  • Output: results/featureCounts (TSV)
    • NOTE: by default, the workflow will run featurecounts on all samples 3 times. It will produce 3 folders as subdirectories of results/featureCounts: unstranded, stranded and reverse. This is for the purpose of inferring strandedness from the count summary stats. If the strandedness of the library is already known, this can be specified in config/config.yaml, and only a single subdirectory will be produced.

7. Transcript-level counts (optional)

Summarisation of read counts to the transcript-level is performed with Salmon.

Module: salmon.smk

  • Input: results/trim and/or results/merge (FASTQ)
  • Output: results/salmon (One directory per sample)

Other features

Quality control (optional)

Quality reports are produced with FastQC for raw, trimmed and aligned data.

Module: fastqc.smk

Reference files & indexing

Genome, transcriptome and annotation files are downloaded from Ensembl. The user may provide their own reference files by copying them into the resources/, with filenames genome.fa, transcriptome.fa and annotation.gtf respectively. The workflow produces reference indices for both STAR and Salmon as required.

Module: refs.smk

Single and paired end data compatibility

Both single and paired end data is compatible with this workflow. This is specified by how one configures the config/units.tsv file (see config/README.md).

Standardised usage

Standardised usage of this workflow is described in the Snakemake Workflow Catalog.

However, Snakemake standardised usage requires internet access which is commonly unavailable in an HPC environment. If the intention is to run the workflow in an offline environment, please see Recommended usage.

Recommended usage

For compatibility across environments, the source code of this workflow is available via Releases.

  1. Download and extract the workflow's latest release
  2. Follow the instructions in config/README.md to modify config/samples.tsv and config/units.tsv
  3. Follow the comments in config/config.yaml to configure the workflow parameters
  4. Use the example profile as a guide to fine-tune workflow-specific resource configuration
    • NOTE: the example profile has been designed for compatibility with my SLURM profile
  5. Execute the workflow bash snakemake

Testing

Example data and configurations are available in the .test directory for testing this workflow. The example data is small, so the test workflow profile can be used upon execution. To keep and examine intermediary files, specify the --notemp flag.

```bash

Test paired-end

snakemake --configfile .test/config_pe/config.yaml --workflow-profile workflow/profiles/test --notemp

or single-end

snakemake --configfile .test/config_se/config.yaml --workflow-profile workflow/profiles/test --notemp ```

Owner

  • Name: Lachlan Baer
  • Login: baerlachlan
  • Kind: user
  • Location: Adelaide, South Australia
  • Company: University of Adelaide

GitHub Events

Total
  • Release event: 11
  • Watch event: 1
  • Push event: 16
  • Fork event: 1
  • Create event: 11
Last Year
  • Release event: 11
  • Watch event: 1
  • Push event: 16
  • Fork event: 1
  • Create event: 11