small-rna-seq-pipeline

A pipeline to annotate miRNAs, phased siRNAs and other types using a reference genome and experimental sRNA-Seq data

https://github.com/bleekerlab/small-rna-seq-pipeline

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

A pipeline to annotate miRNAs, phased siRNAs and other types using a reference genome and experimental sRNA-Seq data

Basic Info

Host: GitHub
Owner: BleekerLab
License: mit
Language: Python
Default Branch: master
Size: 76.7 MB

Statistics

Stars: 6
Watchers: 1
Forks: 2
Open Issues: 3
Releases: 5

Created over 7 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License Citation

The small RNA-Seq pipeline

Summary
Installation
- Create a Conda environment
- Dependencies
Usage
Authors
- Contributors
- Maintainers
Citation
License
Versioning
Acknowledgments
References

Summary

The small RNA-Seq description pipeline is a Snakemake pipeline to annotate small RNA loci (miRNAs, phased siRNAs) using one or more reference genomes and based on experimental small RNA-Seq datasets.
This pipeline heavily relies on the ShortStack software that annotates and quantifies small RNAs using a reference genome.

Upon completion, several outputs will be generated for each sample: - One Shortstack result file called Results.txt. See the description of this file in the Shortstack manual. - Two fasta files for each sample: one fasta file containing the predicted hairpins and one containing the predicted mature microRNAs. - Two blast result files (in tabular format) based on the blast of predicted hairpins and mature miRNAs against mirbase (the version of miRBase is specified in the config file). See the miRBase website for releases.

Installation

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Create a Conda environment

This Snakemake pipeline make use of the conda package manager to install softwares and dependencies. 1. First, make sure you have conda installed on your system. Use Miniconda3 and follow the installation instructions.
2. Using conda, create a virtual environment called snakemake to install Snakemake (version 5.4.3 or higher) by executing the following code in a Shell window: conda env create -f environment.yml. This will install snakemake version 5.20.0 and pandas version 0.25.0 in a new environment called small. 3. Activate this environment using: conda activate small 4. You can now run the pipeline (see below).

If you have set up conda and created the small environment, that's all you need to do!

Dependencies

Snakemake - The Snakemake workflow management system is a tool to create reproducible and scalable data analyses.
NCBI blast+ - A program to perform sequence similarity search. See NCBI Blast webpage for more info.
ShortStack - Small RNA loci annotation and quantification.
Trimmomatic - Read trimming for NGS data.
bioawk - Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names.

A series of custom Python functions are also used and can be found in the helpers.py file.
Versions of softwares and packages can be seen in their respective environment .YAML file in the envs/ folder.

Usage

Example

A small dataset is available in test/ to run some tests rapidly. It will use the genome and miRBase reference fasta files stored in refs/.
To run the test, open a new Shell window and: 1. Activate your working environment: conda activate small 2. Type snakemake -j 1 -np for a dry run. No analysis is run but it checks that the Directed Acyclic Graph of jobs is OK (input and output from each rule chained to each other). 3. For the real run, type snakemake --cores N where N is the number of CPUs that you want to use (default = 1).

Samples

A samples.tsv file can be used to specify sample names, their corresponding genomic reference to use and the location of their sequencing file.

Configuration

Configuration settings can be changed in the config.yaml file. For instance, one could modify the minimal coverage required by Shorstack to discover sRNA loci.

Genomic references

Different genomic references can be used for each sample. Simply provide a genomic reference corresponding to your sample.

Authors

Contributors

Marc Galland - Initial work - Github profile
Michelle van der Gragt - Initial work - Github profile

Maintainers

Marc Galland - Initial work - Github profile

Citation

...as soon as we have published this software!

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Versioning

SemVer is used for versioning. For the versions available, see the releases on this repository.

Acknowledgments

References

Bioawk tutorial: https://isugenomics.github.io/bioinformatics-workbook/Appendix/bioawk-basics
Vienna RNAfold tutorial: https://www.tbi.univie.ac.at/RNA/tutorial/#sec3
miRTop: from BAM files to GFF3 files (and conversion to other formats such as Fasta etc.): https://academic.oup.com/bioinformatics/article/36/3/698/5556118

Owner

Name: Petra Bleeker laboratory
Login: BleekerLab
Kind: organization
Email: P.M.Bleeker@uva.nl
Location: University of Amsterdam

Repositories: 6
Profile: https://github.com/BleekerLab

Laboratory of Petra Bleeker at University of Amsterdam

GitHub Events

Total

Last Year

Dependencies

environment.yml conda

bioawk 1.0.*
biopython 1.74.*
blast 2.9.*
bowtie 1.2.0.*
fastp 0.20.0.*
multiqc 1.9.*
pandas 0.25.0.*
samtools 1.11.*
shortstack 3.8.5.*
snakemake 5.20.0.*
viennarna 2.4.14.*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

small-rna-seq-pipeline

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

The small RNA-Seq pipeline

Summary

Installation

Create a Conda environment

Dependencies

Usage

Example

Samples

Configuration

Genomic references

Authors

Contributors

Maintainers

Citation

License

Versioning

Acknowledgments

References

Owner

GitHub Events

Total

Last Year

Dependencies