https://github.com/baerlachlan/smk-rnaseq-counts
Snakemake workflow for estimating read counts from RNA-seq data
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Repository
Snakemake workflow for estimating read counts from RNA-seq data
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 2
- Open Issues: 0
- Releases: 21
Metadata Files
README.md
Snakemake workflow for estimating read counts from RNA-seq data
This Snakemake workflow implements the pre-processing steps to estimate gene- and transcript-level read counts from raw RNA-seq data.
Contents
Workflow summary
1. Raw data
Raw RNA-seq data is expected in FASTQ format.
No specific location is required for the raw data.
This is left flexible for the user, however the chosen location must be specified in config/config.yaml.
2. Trim
Trimming is performed with fastp
Module: trim.smk
- Input: Raw data at chosen location (
FASTQ) - Output:
results/trim(FASTQ)
3. Merge (optional)
Merging involves concatenating files of the same sample.
This step will only be performed when multiple sequencing units are specified for the same sample identifier in config/units.tsv.
A common example for multiple sequencing units is when a sample is split across multiple lanes.
Module: merge.smk
- Input:
results/trim(FASTQ) - Output:
results/merge(FASTQ)
4. Align
Aligment to the genome is performed with STAR.
Module: align.smk
- Input:
results/trimand/orresults/merge(FASTQ) - Output:
results/align(BAM)
5. Deduplicate (optional)
Deduplication using Unique Molecular Identifiers (UMIs) and mapping position is performed with UMI-tools.
This is currently only available for gene-level counts with featureCounts.
Module: deduplicate.smk
- Input:
results/align(BAM) - Output:
results/deduplicate(BAM)
6. Gene-level counts (optional)
Summarisation of read counts to the gene-level is performed with featureCounts.
Module: featureCounts.smk
- Input:
results/alignorresults/deduplicate(BAM) - Output:
results/featureCounts(TSV)- NOTE: by default, the workflow will run featurecounts on all samples 3 times. It will produce 3 folders as subdirectories of
results/featureCounts:unstranded,strandedandreverse. This is for the purpose of inferring strandedness from the count summary stats. If the strandedness of the library is already known, this can be specified inconfig/config.yaml, and only a single subdirectory will be produced.
- NOTE: by default, the workflow will run featurecounts on all samples 3 times. It will produce 3 folders as subdirectories of
7. Transcript-level counts (optional)
Summarisation of read counts to the transcript-level is performed with Salmon.
Module: salmon.smk
- Input:
results/trimand/orresults/merge(FASTQ) - Output:
results/salmon(One directory per sample)
Other features
Quality control (optional)
Quality reports are produced with FastQC for raw, trimmed and aligned data.
Module: fastqc.smk
Reference files & indexing
Genome, transcriptome and annotation files are downloaded from Ensembl.
The user may provide their own reference files by copying them into the resources/, with filenames genome.fa, transcriptome.fa and annotation.gtf respectively.
The workflow produces reference indices for both STAR and Salmon as required.
Module: refs.smk
Single and paired end data compatibility
Both single and paired end data is compatible with this workflow.
This is specified by how one configures the config/units.tsv file (see config/README.md).
Standardised usage
Standardised usage of this workflow is described in the Snakemake Workflow Catalog.
However, Snakemake standardised usage requires internet access which is commonly unavailable in an HPC environment. If the intention is to run the workflow in an offline environment, please see Recommended usage.
Recommended usage
For compatibility across environments, the source code of this workflow is available via Releases.
- Download and extract the workflow's latest release
- Follow the instructions in
config/README.mdto modifyconfig/samples.tsvandconfig/units.tsv - Follow the comments in
config/config.yamlto configure the workflow parameters - Use the example profile as a guide to fine-tune workflow-specific resource configuration
- NOTE: the example profile has been designed for compatibility with my SLURM profile
- Execute the workflow
bash snakemake
Testing
Example data and configurations are available in the .test directory for testing this workflow.
The example data is small, so the test workflow profile can be used upon execution.
To keep and examine intermediary files, specify the --notemp flag.
```bash
Test paired-end
snakemake --configfile .test/config_pe/config.yaml --workflow-profile workflow/profiles/test --notemp
or single-end
snakemake --configfile .test/config_se/config.yaml --workflow-profile workflow/profiles/test --notemp ```
Owner
- Name: Lachlan Baer
- Login: baerlachlan
- Kind: user
- Location: Adelaide, South Australia
- Company: University of Adelaide
- Twitter: baerlachlan
- Repositories: 9
- Profile: https://github.com/baerlachlan
GitHub Events
Total
- Release event: 11
- Watch event: 1
- Push event: 16
- Fork event: 1
- Create event: 11
Last Year
- Release event: 11
- Watch event: 1
- Push event: 16
- Fork event: 1
- Create event: 11