obitools_workflow
A snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: researchgate.net, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary
Keywords
Repository
A snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.
Basic Info
Statistics
- Stars: 4
- Watchers: 1
- Forks: 2
- Open Issues: 3
- Releases: 3
Topics
Metadata Files
README.md
OBITools workflow
Table of Contents
About
This is a Snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.
Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Mölder et al. 2021).
Getting started
Installation
Dependencies
In order to run the workflow, the following languages/programs are required:
Please note that the workflow is currently running exclusively on Unix systems.
Install the workflow
Clone the repository:
sh
git clone https://github.com/AnneSoBen/obitools_workflow.git
Directories and files structure
The repository contains five folders:
- config/: contains the configuration file of the Snakemake workflow (config.yaml). This is where the value of the options for the various commands used is defined.
- log/: where log files of each rule are written.
- resources/: where you should download/copy your raw data (cf. Download your data)
- results/: where all output files are written.
- workflow/: contains the Snakemake workflow (Snakefile), the configuration file of the submission parameters on the cluster (cluster.yaml) and the script to submit the workflow on the cluster (sub_smk.sh).
Download your data
Download/copy your data in the resources/ folder. Three files are required:
- forward and reverse fastq files
- the corresponding ngsfilter file
They should be named as follows: prefix_R1.fastq, prefix_R2.fastq, prefix_ngsfilter.tab
And be put in a subfolder whose name is the prefix of the files (see Example).
Usage
Configuration
Before running the workflow, the configuration file (config/config.yaml) has to be edited. The parameters that can be set are listed in the table below:
| parameter | description | concerned rule(s) | default value | comment | |--------------------|--------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| | tomerge | whether to merge libraries before dereplication | mergedemultiplex | FALSE | should be set to 'TRUE' if you analyse several libraries that you want to merge | | resourcesfolder | relative path to the folder containing resource files (fastq files and ngsfilter) | splitfastq, demultiplex | ../resources | should not be changed, unless you want to rename the folder | | resultsfolder | relative path to the folder where output files will be written | all | ../results | should not be changed, unless you want to rename the folder | | fastqfiles | prefix of the name of the resource fastq files and ngsfilter | all | wolfdiet | must be changed to match your files name prefix | | mergedfile | prefix of the name of the output files if tomerge=TRUE | mergedemultiplex, splitfasta, derepl, mergederepl, basicfilt, clustering, mergeclust, tabformat | wolfdiet | must be changed for the merged files name prefix you want | | splitfastq:nfiles | number of files to create when splitting fastq files for pairing | splitfastq | 2 | should be changed according to the size of your dataset: the bigger it is, the more you will want to split your initial files - useful only on multi-threaded systems | | minscore | minimum alignment score required for pairing | alifilt | 40.00 | set according to Taberlet et al. 2018 | | splitfasta:nfiles | number of files to create when splitting demultiplexed fasta files for dereplication | split_fasta | 2 | should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial file(s) | | minlength | minimum sequence length (in bp) | basicfilt | 80 | must be changed according to the minimum length expected for your barcode | | mincount | minimum number of reads per unique sequence | basicfilt | 1 | it's up to you! | | minsim | similarity threshold for clustering | clustering | 0.97 | it's up to you! |
If you run the workflow on a SLURM cluster, you must also check the workflow/cluster.yaml that sets up the ressources available for each rule.
Run the workflow
Then, run the workflow:
sh
cd workflow
conda activate snakemake
snakemake -c1 --use-conda
Alternatively, you can run the workflow with a single command on a SLURM cluster by submitting the sub_smk.sh file:
sh
cd workflow
sbatch sub_smk.sh
Example
Download toy data
If you want to test the workflow, download the toy dataset from the obitools tutorial (https://pythonhosted.org/OBITools/wolves.html) in the resources/ folder:
sh
wget -O resources/wolf_tutorial.zip https://pythonhosted.org/OBITools/_downloads/wolf_tutorial.zip
unzip resources/wolf_tutorial.zip -d resources/
mv resources/wolf_tutorial resources/wolf_diet
rm resources/wolf_tutorial.zip
Rename the files to fit the template decribed above (or create symbolic links):
sh
cd resources/wolf_diet
ln -s wolf_F.fastq wolf_diet_R1.fastq
ln -s wolf_R.fastq wolf_diet_R2.fastq
ln -s wolf_diet_ngsfilter.txt wolf_diet_ngsfilter.tab
You should get this directory and file structure:
sh
tree
.
├── config
│ └── config.yaml
├── LICENSE
├── log
├── README.md
├── resources
│ └── wolf_diet
│ ├── db_v05_r117.fasta
│ ├── embl_r117.ndx
│ ├── embl_r117.rdx
│ ├── embl_r117.tdx
│ ├── wolf_diet_ngsfilter.tab -> wolf_diet_ngsfilter.txt
│ ├── wolf_diet_ngsfilter.txt
│ ├── wolf_diet_R1.fastq -> wolf_F.fastq
│ ├── wolf_diet_R2.fastq -> wolf_R.fastq
│ ├── wolf_F.fastq
│ └── wolf_R.fastq
├── results
└── workflow
├── cluster.yaml
├── Snakefile
└── sub_smk.sh
Note that the name of the subfolder containing your source files (fastq and ngsfilter files) should be the prefix of the files.
The config.yaml file is already modified to fit this data.
Run the workflow
Now run the workflow:
sh
cd ../../workflow/
conda activate snakemake
snakemake -c1 --use-conda
Option: merging libraries
You may want to merge libraries, for example if technical replicates are split in different libraries. To allow this, the value of "tomerge" in the config/config.yaml file should be set to TRUE. The prefix of your library files should be listed in the config/config.yaml file, such as:
tomerge:
TRUE
resourcesfolder:
../resources/
resultsfolder:
../results/
fastqfiles:
- myfirstlibfileprefix
- mysecondlibfileprefix
mergedfile:
mymergedlibs
The source files of each library should be in separate subfolders. For example:
└─ resources
└── myfirstlibprefix
| ├── myfirstlibprefix_ngsfilter.tab
| ├── myfirstlibprefix_R1.fastq
| └── myfirstlibprefix_R2.fastq
└── mysecondlibprefix
├── mysecondlibprefix_ngsfilter.tab
├── mysecondlibprefix_R1.fastq
└── mysecondlibprefix_R2.fastq
Two ngsfilter files will be necessary: resources/myfirstlibfileprefix/myfirstlibfileprefix_ngsfilter.tab and resources/myfirstlibfileprefix/mysecondlibfileprefix_ngsfilter.tab.
:warning: If you want to be able to distinguish your technical replicates in the final output, don't forget to give your samples different names in the ngsfilter files, e.g. for a sample named "sample", you could change its name to "samplea" in the first ngsfilter file and "sampleb" in the second ngsfilter file (if you have two technical replicates).
The value of "mergedfile" corresponds to the prefix of the merged files from the dereplication to the end of the workflow.
Going further
You may want to clean up potential molecular artifacts: have a look at the R package metabaR!
Acknowledgements
Thanks to Lucie Zinger, Frédéric Boyer, Céline Mercier and Clément Lionnet for their help with the obitools! Also thanks to the ECOFEED project for funding the development of the first version of this workflow.
How to cite this repository
Anne-Sophie Benoiston. (2022). AnneSoBen/obitools_workflow: v1.0.2. GitHub. https://doi.org/10.5281/zenodo.6676577.
:triangularflagonpost: Don't forget to cite this repository if you use it for your research :slightlysmiling_face:
References
Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.
Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).
Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., ... & Köster, J. (2021). Sustainable data analysis with Snakemake. F1000Research, 10.
Zinger, L., Lionnet, C., Benoiston, A. S., Donald, J., Mercier, C., & Boyer, F. (2021). metabaR: an R package for the evaluation and improvement of DNA metabarcoding data quality. Methods in Ecology and Evolution, 12(4), 586-592.
Owner
- Name: Anne-Sophie Benoiston
- Login: AnneSoBen
- Kind: user
- Location: Toulouse
- Company: Institut de Recherche pour le Développement
- Repositories: 2
- Profile: https://github.com/AnneSoBen
Bioinformatician at IRD in the EDB lab in Toulouse
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Benoiston
given-names: Anne-Sophie
orcid: https://orcid.org/0000-0001-9446-5703
title: AnneSoBen/obitools_workflow
version: 1.0.2
publisher: GitHub
year: 2022
howpublished: https://github.com/AnneSoBen/obitools_workflow
commit: 82f5ec5bbd8b3e58d6fd0fd5212bcf1a561ce3cf
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1