small-rna-seq-pipeline
A pipeline to annotate miRNAs, phased siRNAs and other types using a reference genome and experimental sRNA-Seq data
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.6%) to scientific vocabulary
Repository
A pipeline to annotate miRNAs, phased siRNAs and other types using a reference genome and experimental sRNA-Seq data
Basic Info
- Host: GitHub
- Owner: BleekerLab
- License: mit
- Language: Python
- Default Branch: master
- Size: 76.7 MB
Statistics
- Stars: 6
- Watchers: 1
- Forks: 2
- Open Issues: 3
- Releases: 5
Metadata Files
README.md
The small RNA-Seq pipeline
Summary
The small RNA-Seq description pipeline is a Snakemake pipeline to annotate small RNA loci (miRNAs, phased siRNAs) using one or more reference genomes and based on experimental small RNA-Seq datasets.
This pipeline heavily relies on the ShortStack software that annotates and quantifies small RNAs using a reference genome.
Upon completion, several outputs will be generated for each sample:
- One Shortstack result file called Results.txt. See the description of this file in the Shortstack manual.
- Two fasta files for each sample: one fasta file containing the predicted hairpins and one containing the predicted mature microRNAs.
- Two blast result files (in tabular format) based on the blast of predicted hairpins and mature miRNAs against mirbase (the version of miRBase is specified in the config file). See the miRBase website for releases.
Installation
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
Create a Conda environment
This Snakemake pipeline make use of the conda package manager to install softwares and dependencies.
1. First, make sure you have conda installed on your system. Use Miniconda3 and follow the installation instructions.
2. Using conda, create a virtual environment called snakemake to install Snakemake (version 5.4.3 or higher) by executing the following code in a Shell window: conda env create -f environment.yml. This will install snakemake version 5.20.0 and pandas version 0.25.0 in a new environment called small.
3. Activate this environment using: conda activate small
4. You can now run the pipeline (see below).
If you have set up conda and created the small environment, that's all you need to do!
Dependencies
- Snakemake - The Snakemake workflow management system is a tool to create reproducible and scalable data analyses.
- NCBI blast+ - A program to perform sequence similarity search. See NCBI Blast webpage for more info.
- ShortStack - Small RNA loci annotation and quantification.
- Trimmomatic - Read trimming for NGS data.
- bioawk - Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names.
A series of custom Python functions are also used and can be found in the helpers.py file.
Versions of softwares and packages can be seen in their respective environment .YAML file in the envs/ folder.
Usage
Example
A small dataset is available in test/ to run some tests rapidly. It will use the genome and miRBase reference fasta files stored in refs/.
To run the test, open a new Shell window and:
1. Activate your working environment: conda activate small
2. Type snakemake -j 1 -np for a dry run. No analysis is run but it checks that the Directed Acyclic Graph of jobs is OK (input and output from each rule chained to each other).
3. For the real run, type snakemake --cores N where N is the number of CPUs that you want to use (default = 1).
Samples
A samples.tsv file can be used to specify sample names, their corresponding genomic reference to use and the location of their sequencing file.
Configuration
Configuration settings can be changed in the config.yaml file. For instance, one could modify the minimal coverage required by Shorstack to discover sRNA loci.
Genomic references
Different genomic references can be used for each sample. Simply provide a genomic reference corresponding to your sample.
Authors
Contributors
- Marc Galland - Initial work - Github profile
- Michelle van der Gragt - Initial work - Github profile
Maintainers
- Marc Galland - Initial work - Github profile
Citation
...as soon as we have published this software!
License
This project is licensed under the MIT License - see the LICENSE.md file for details
Versioning
SemVer is used for versioning. For the versions available, see the releases on this repository.
Acknowledgments
References
- Bioawk tutorial: https://isugenomics.github.io/bioinformatics-workbook/Appendix/bioawk-basics
- Vienna RNAfold tutorial: https://www.tbi.univie.ac.at/RNA/tutorial/#sec3
- miRTop: from BAM files to GFF3 files (and conversion to other formats such as Fasta etc.): https://academic.oup.com/bioinformatics/article/36/3/698/5556118
Owner
- Name: Petra Bleeker laboratory
- Login: BleekerLab
- Kind: organization
- Email: P.M.Bleeker@uva.nl
- Location: University of Amsterdam
- Repositories: 6
- Profile: https://github.com/BleekerLab
Laboratory of Petra Bleeker at University of Amsterdam
GitHub Events
Total
Last Year
Dependencies
- bioawk 1.0.*
- biopython 1.74.*
- blast 2.9.*
- bowtie 1.2.0.*
- fastp 0.20.0.*
- multiqc 1.9.*
- pandas 0.25.0.*
- samtools 1.11.*
- shortstack 3.8.5.*
- snakemake 5.20.0.*
- viennarna 2.4.14.*