https://github.com/erictleung/dada2hpcpipe
:ocean: 16S rRNA microbiome data analysis workflow using DADA2 and R on a high performance cluster using SLURM
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 3 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary
Keywords
16s
amplicon-sequencing
dada2
microbiome
microbiome-analysis
performance-cluster
pipeline
slurm
workflow
Last synced: 4 months ago
·
JSON representation
Repository
:ocean: 16S rRNA microbiome data analysis workflow using DADA2 and R on a high performance cluster using SLURM
Basic Info
Statistics
- Stars: 3
- Watchers: 2
- Forks: 2
- Open Issues: 13
- Releases: 0
Topics
16s
amplicon-sequencing
dada2
microbiome
microbiome-analysis
performance-cluster
pipeline
slurm
workflow
Created over 8 years ago
· Last pushed over 7 years ago
https://github.com/erictleung/dada2HPCPipe/blob/master/
# dada2HPCPipe
[](https://travis-ci.org/erictleung/dada2HPCPipe)
16S rRNA microbiome data analysis workflow using DADA2 and R on a high
performance cluster.
This repository contains essentially wrapper functions around DADA2 functions
in order to streamline the workflow for cluster computing.
This package is meant to serve two purposes: be an R package and give structure
to an analysis. The project aims to follow an R package structure, which can be
downloaded and installed as such. Additionally, the users is expected to
download this repository and run `make` and slurm commands to run scripts.
**Table of Conents**
- [Installation](#installation)
- [Description](#description)
- [Overview of Makefile](#overview-of-makefile)
- [Development Setup](#development-setup)
- [R Environment Setup](#r-environment-setup)
- [Package Management](#package-management)
- [Slurm Workload Manager](#slurm-workload-manager)
- [Troubleshooting](#troubleshooting)
## Installation
```R
install.packages("devtools")
devtools::install_github("erictleung/dada2HPCPipe")
```
## Description
This DADA2 workflow stems from the [DADA2 tutorial][dada2tut] and [big data
tutorial][dada2big]. You can find more information about the DADA2 package from
its [publication][nature] or from [GitHub][github].
[dada2tut]: http://benjjneb.github.io/dada2/tutorial.html
[dada2big]: http://benjjneb.github.io/dada2/bigdata.html
[nature]: http://dx.doi.org/10.1038/nmeth.3869
[github]: https://github.com/benjjneb/dada2
## Overview of Makefile
```
clean Remove data from test_data/, download/, and refs/
condar Install R and essential packages
dl-ref-dbs Download 16S reference databases (SILVA,RDP,GG)
help Help page for Makefile
install Install and update dada2HPCPipe package in R
setup Setup development environment with Conda
test Run DADA2 workflow with Mothur MiSeq test data
```
## Development Setup
Here are instructions on how to get started on ExaCloud and setting up the
development environment needed to run the DADA2 workflow.
### Package Management
**Interactive Session**
To run an interactive session, run the following:
```bash
srun --pty /usr/bin/bash
```
This will allow you to test your code and workflow without worrying about
stressing out the head coordinating node.
**Setup**
Follow the instructions listed in [this document][exacloud] to setup a modern
development environment on the cluster. This isn't necessary if your
development environment is on a cluster where you have root access or you're
implementing this workflow locally.
Briefly, following the instructions linked above will give you the following:
- Miniconda, Python package and virtual environment management
- Linuxbrew, non-root package management on Linux systems
For this R workflow, you'll only really need to install Miniconda and R
essentials. The Anaconda environment has build [an `r-essential`
package][condar] with R and most used R packages for data science.
Linuxbrew is useful to supplement commands and other software tools you might
want under package management control.
In summary to setup the dependencies for DADA2, run the following.
```bash
make setup
make condar
```
The `make setup` runs the following
```bash
# Download and install Miniconda
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh
# Say yes to adding Miniconda to .bash_profile
# Remove install file
rm Miniconda2-latest-Linux-x86_64.sh
```
To see changes, you'll first need to exit the cluster and log back in.
The `make condar` installs R and other essential R packages, which is laid out
as the following
```bash
# Install R and relevant packages
conda install r-essentials
# For maintenance and update of all packages
conda update r-essentials
# For updating a single particular R package, replace XXXX
conda update r-XXXX
```
[exacloud]: https://github.com/greenstick/bootstrapping-package-management-on-exacloud
[condar]: https://conda.io/docs/r-with-conda.html
### Slurm Workload Manager
Slurm is the resource manager that I'll focus on for this workflow. Slurm
stands for "Simple Linux Utility for Resource Management."
An example script might be this.
```bash
$ cat first_script.sh
#!/bin/bash
# Template for simple SLURM script
# SBATCH --job-name="Job Name"
# SBATCH --partition=exacloud
srun hostname
srun pwd
srun hostinfo
```
The quick answer on `sbatch` vs `srun` can be found [here][srunsbatch].
Below are some useful commands to use within Slurm using the script above.
```bash
# Submit your script, first_script.sh
sbatch first_script.sh
# Look at jobs in the queue
squeue
squeue -u $USER # Take a look at your specific jobs
```
You can use [this website][ceci] to help generate Slurm scripts. It is designed
for another cluster, but it should at least help with the initial drafting of a
submit script you'll want to use.
For more general resources on using Slurm, check out [here][vandyslurm],
[here][gatorslurm], and [here][michiganslurm].
[srunsbatch]: https://www.cs.virginia.edu/~csadmin/wiki/index.php/SLURM#Submitting_Jobs
[ceci]: http://www.ceci-hpc.be/scriptgen.html
[vandyslurm]: https://www.vanderbilt.edu/accre/documentation/slurm/
[gatorslurm]: https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts
[michiganslurm]: https://sph.umich.edu/biostat/computing/cluster/slurm.html
**Source**: http://www.cism.ucl.ac.be/Services/Formations/slurm/2016/slurm.pdf
## Troubleshooting
**Installing this package says it has trouble installing Bioconductor
packages**
There are two solutions for this. From within R, you can run the following
```R
setRepositories(ind=1:2)
```
which will tell R to also include Bioconductor packages in its package
search. See https://stackoverflow.com/a/34617938/6873133 for more information.
Additionally, you can install Bioconductor manually using the following within
R
```R
# Try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite()
```
and then using `biocLite()` to install the missing packages. See
http://bioconductor.org/install/.
**How do I update my packages?**
For regular R package (i.e. non-Bioconductor packages), use `conda` from the
terminal.
```shell
# XXX is the package name
conda install r-XXX
# For example, installing XML
conda install r-xml
```
But for Bioconductor packages, use `biocLite()` from within R.
```R
# Try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
# E.g. installing DESeq2
biocLite("DESeq2")
```
Owner
- Name: Eric Leung
- Login: erictleung
- Kind: user
- Location: New York, NY
- Website: https://erictleung.com
- Repositories: 169
- Profile: https://github.com/erictleung
Data science generalist. Sharing knowledge and optimizing tools for learning and growth. Open-source and open-data advocate. Community learner.