obitools_workflow

A snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.

https://github.com/annesoben/obitools_workflow

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README
✓
Academic publication links
Links to: researchgate.net, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Keywords

bioinformatics-pipeline edna edna-pipeline metabarcoding

Last synced: 6 months ago · JSON representation ·

Repository

A snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.

Basic Info

Host: GitHub
Owner: AnneSoBen
License: gpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 64.5 KB

Statistics

Stars: 4
Watchers: 1
Forks: 2
Open Issues: 3
Releases: 3

Topics

bioinformatics-pipeline edna edna-pipeline metabarcoding

Created almost 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Citation

OBITools workflow

Table of Contents

About
Getting Started
- Installation
- Directories and files structure
- Download your data
Usage
- Configuration

About

This is a Snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.

Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Mölder et al. 2021).

Getting started

Installation

Dependencies

In order to run the workflow, the following languages/programs are required:

Please note that the workflow is currently running exclusively on Unix systems.

Install the workflow

Clone the repository: sh git clone https://github.com/AnneSoBen/obitools_workflow.git

Directories and files structure

The repository contains five folders: - config/: contains the configuration file of the Snakemake workflow (config.yaml). This is where the value of the options for the various commands used is defined. - log/: where log files of each rule are written. - resources/: where you should download/copy your raw data (cf. Download your data) - results/: where all output files are written. - workflow/: contains the Snakemake workflow (Snakefile), the configuration file of the submission parameters on the cluster (cluster.yaml) and the script to submit the workflow on the cluster (sub_smk.sh).

Download your data

Download/copy your data in the resources/ folder. Three files are required: - forward and reverse fastq files - the corresponding ngsfilter file

They should be named as follows: prefix_R1.fastq, prefix_R2.fastq, prefix_ngsfilter.tab

And be put in a subfolder whose name is the prefix of the files (see Example).

Usage

Configuration

Before running the workflow, the configuration file (config/config.yaml) has to be edited. The parameters that can be set are listed in the table below:

| parameter | description | concerned rule(s) | default value | comment | |--------------------|--------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| | tomerge | whether to merge libraries before dereplication | mergedemultiplex | FALSE | should be set to 'TRUE' if you analyse several libraries that you want to merge | | resourcesfolder | relative path to the folder containing resource files (fastq files and ngsfilter) | splitfastq, demultiplex | ../resources | should not be changed, unless you want to rename the folder | | resultsfolder | relative path to the folder where output files will be written | all | ../results | should not be changed, unless you want to rename the folder | | fastqfiles | prefix of the name of the resource fastq files and ngsfilter | all | wolfdiet | must be changed to match your files name prefix | | mergedfile | prefix of the name of the output files if tomerge=TRUE | mergedemultiplex, splitfasta, derepl, mergederepl, basicfilt, clustering, mergeclust, tabformat | wolfdiet | must be changed for the merged files name prefix you want | | splitfastq:nfiles | number of files to create when splitting fastq files for pairing | splitfastq | 2 | should be changed according to the size of your dataset: the bigger it is, the more you will want to split your initial files - useful only on multi-threaded systems | | minscore | minimum alignment score required for pairing | alifilt | 40.00 | set according to Taberlet et al. 2018 | | splitfasta:nfiles | number of files to create when splitting demultiplexed fasta files for dereplication | split_fasta | 2 | should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial file(s) | | minlength | minimum sequence length (in bp) | basicfilt | 80 | must be changed according to the minimum length expected for your barcode | | mincount | minimum number of reads per unique sequence | basicfilt | 1 | it's up to you! | | minsim | similarity threshold for clustering | clustering | 0.97 | it's up to you! |

If you run the workflow on a SLURM cluster, you must also check the workflow/cluster.yaml that sets up the ressources available for each rule.

Run the workflow

Then, run the workflow: sh cd workflow conda activate snakemake snakemake -c1 --use-conda

Alternatively, you can run the workflow with a single command on a SLURM cluster by submitting the sub_smk.sh file: sh cd workflow sbatch sub_smk.sh

Example

Download toy data

If you want to test the workflow, download the toy dataset from the obitools tutorial (https://pythonhosted.org/OBITools/wolves.html) in the resources/ folder: sh wget -O resources/wolf_tutorial.zip https://pythonhosted.org/OBITools/_downloads/wolf_tutorial.zip unzip resources/wolf_tutorial.zip -d resources/ mv resources/wolf_tutorial resources/wolf_diet rm resources/wolf_tutorial.zip Rename the files to fit the template decribed above (or create symbolic links): sh cd resources/wolf_diet ln -s wolf_F.fastq wolf_diet_R1.fastq ln -s wolf_R.fastq wolf_diet_R2.fastq ln -s wolf_diet_ngsfilter.txt wolf_diet_ngsfilter.tab You should get this directory and file structure: sh tree

. ├── config │ └── config.yaml ├── LICENSE ├── log ├── README.md ├── resources │ └── wolf_diet │ ├── db_v05_r117.fasta │ ├── embl_r117.ndx │ ├── embl_r117.rdx │ ├── embl_r117.tdx │ ├── wolf_diet_ngsfilter.tab -> wolf_diet_ngsfilter.txt │ ├── wolf_diet_ngsfilter.txt │ ├── wolf_diet_R1.fastq -> wolf_F.fastq │ ├── wolf_diet_R2.fastq -> wolf_R.fastq │ ├── wolf_F.fastq │ └── wolf_R.fastq ├── results └── workflow ├── cluster.yaml ├── Snakefile └── sub_smk.sh

Note that the name of the subfolder containing your source files (fastq and ngsfilter files) should be the prefix of the files.

The config.yaml file is already modified to fit this data.

Run the workflow

Now run the workflow: sh cd ../../workflow/ conda activate snakemake snakemake -c1 --use-conda

Option: merging libraries

You may want to merge libraries, for example if technical replicates are split in different libraries. To allow this, the value of "tomerge" in the config/config.yaml file should be set to TRUE. The prefix of your library files should be listed in the config/config.yaml file, such as:

tomerge: TRUE resourcesfolder: ../resources/ resultsfolder: ../results/ fastqfiles: - myfirstlibfileprefix - mysecondlibfileprefix mergedfile: mymergedlibs

The source files of each library should be in separate subfolders. For example:

└─ resources └── myfirstlibprefix | ├── myfirstlibprefix_ngsfilter.tab | ├── myfirstlibprefix_R1.fastq | └── myfirstlibprefix_R2.fastq └── mysecondlibprefix ├── mysecondlibprefix_ngsfilter.tab ├── mysecondlibprefix_R1.fastq └── mysecondlibprefix_R2.fastq

Two ngsfilter files will be necessary: resources/myfirstlibfileprefix/myfirstlibfileprefix_ngsfilter.tab and resources/myfirstlibfileprefix/mysecondlibfileprefix_ngsfilter.tab.

:warning: If you want to be able to distinguish your technical replicates in the final output, don't forget to give your samples different names in the ngsfilter files, e.g. for a sample named "sample", you could change its name to "samplea" in the first ngsfilter file and "sampleb" in the second ngsfilter file (if you have two technical replicates).

The value of "mergedfile" corresponds to the prefix of the merged files from the dereplication to the end of the workflow.

Going further

You may want to clean up potential molecular artifacts: have a look at the R package metabaR!

Acknowledgements

Thanks to Lucie Zinger, Frédéric Boyer, Céline Mercier and Clément Lionnet for their help with the obitools! Also thanks to the ECOFEED project for funding the development of the first version of this workflow.

How to cite this repository

Anne-Sophie Benoiston. (2022). AnneSoBen/obitools_workflow: v1.0.2. GitHub. https://doi.org/10.5281/zenodo.6676577.

:triangularflagonpost: Don't forget to cite this repository if you use it for your research :slightlysmiling_face:

References

Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.

Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).

Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., ... & Köster, J. (2021). Sustainable data analysis with Snakemake. F1000Research, 10.

Zinger, L., Lionnet, C., Benoiston, A. S., Donald, J., Mercier, C., & Boyer, F. (2021). metabaR: an R package for the evaluation and improvement of DNA metabarcoding data quality. Methods in Ecology and Evolution, 12(4), 586-592.

Owner

Name: Anne-Sophie Benoiston
Login: AnneSoBen
Kind: user
Location: Toulouse
Company: Institut de Recherche pour le Développement

Repositories: 2
Profile: https://github.com/AnneSoBen

Bioinformatician at IRD in the EDB lab in Toulouse

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Benoiston
    given-names: Anne-Sophie
    orcid:  https://orcid.org/0000-0001-9446-5703 
title: AnneSoBen/obitools_workflow
version: 1.0.2
publisher: GitHub
year: 2022
howpublished: https://github.com/AnneSoBen/obitools_workflow
commit: 82f5ec5bbd8b3e58d6fd0fd5212bcf1a561ce3cf

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

obitools_workflow

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

OBITools workflow

About

Getting started

Installation

Dependencies

Install the workflow

Directories and files structure

Download your data

Usage

Configuration

Run the workflow

Example

Download toy data

Run the workflow

Option: merging libraries

Going further

Acknowledgements

How to cite this repository

References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year