edna-metabarcoding

Collection of scripts and instructions to run metabarcoding analysis

https://github.com/marciofcmartins/edna-metabarcoding

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.7%) to scientific vocabulary

Keywords

bioinformatics

Last synced: 11 months ago · JSON representation ·

Repository

Collection of scripts and instructions to run metabarcoding analysis

Basic Info

Host: GitHub
Owner: MarcioFCMartins
License: mit
Language: R
Default Branch: main
Homepage:
Size: 220 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

bioinformatics

Created about 4 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

About The Project

This repository holds a my pipeline to determine taxonomic groups present in genetic sequencing data. It was created with the 18S and Eukaryonts in mind, but should be equally applicable to other markers.

Getting Started

Expected data format

To use this pipeline, you should start with paired-end (one file per sequencing direction) demultiplexed (one sample per file) and compressed fastq files. The files should be named according to the CASAVA format:

SAMPLEIDBARCODELxxxRxxxx.fastq.gz

Where SAMPLEID is the name of your sample, BARCODE is the barcode used to mark identify which sample the sequence comes from, Lxxx is the lane number in the sequencing machine where the sample was placed, Rx indicates which read direction the file refers to (R1 means foward, R2 reverse), xxx is the set number.

Prerequisites

This pipeline was only tested on Ubuntu 22.04, but should work on any Linux or Mac based operating system. It might work on Windows, but it will be slower because parallel execution will not work.

The R programming language, version 4.2 or higher is required.

All the heavy work (i.e. math stuff) is done by the dada2 package. My contribution was automating the location and preparation of file lists, as well as diagnostic plots so that the analysis can be ran without much user input.

Installation

All you need to install is the R programming language, all other dependencies are installed on demand when you run the application.

Ubuntu 22.04

First, install R:

``` sh

update indices

sudo apt update -qq

install two helper packages we need

sudo apt install --no-install-recommends software-properties-common dirmngr

add the signing key (by Michael Rutter) for these repos

To verify key, run gpg --show-keys /etc/apt/trusted.gpg.d/cranubuntukey.asc

Fingerprint: E298A3A825C0D65DFD57CBB651716619E084DAB9

wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutterpubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cranubuntu_key.asc

add the R 4.0 repo from CRAN -- adjust 'focal' to 'groovy' or 'bionic' as needed

sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"

Install R

sudo apt install --no-install-recommends r-base ```

These instructions were copied from the official R repository.

Other operating systems

Download the installer here.

Usage

This pipeline consists of a set of R scripts which are prepared to be used as command line applications. I do not recommend them from a graphical IDE (such as Rstudio, which is how most R users work), because parallel operations will often crash these applications.

Let's imagine we have our project setup as such:

project-folder
|
|- scripts
|   |
|   |- 01-remove-primer.R
|   |- 02-determine-sample-sequences-R
|   `- 03-assign-taxonomy.R
|- data
|   |
|   `- raw-sequencing-data
|       |
|       |- 012522RSeuk1391F-mapping.txt
|       |- mysample1_S1_L001_R1.fastq.gz
|       `- mysample1_S1_L001_R2.fastq.gz
|- outputs

Please note that your forward (R1) and reverse (R2) can be in separate folders.

STEP 1 - Remove primers If your sequences still have the primers in the reads, your first step is to remove them. To do so, you'll need to provide:

data-directory - where your fastq files can be found\
mapping - a test file which specified which primer was used for each sample\
output - directory where you want your new, primer-less, fastq files to be saved

sh cd project-folder Rscript ./scripts/01-remove-primer.R --data-directory ./data/raw-sequencing-data --mapping ./data/raw-sequencing-data/012522RSeuk1391F-mapping.txt --output ./output/sequencing-noprimer

STEP 2 - Clean up sequence data and determine unique sequences This step does a lot of things at once. It will:

Discard any sequence with undetermined base-pairs\
Truncate sequences at a length determined by the user, to avoid low quality sections\
Remove Phix sequences\
Determine what sequences are found often enough to be a true read and which ones are just sequencing errors\
Merge paired-end sequences\
Export a unique sequence table (with and without chimeras) which can then be used to determine taxonomy

You have to provide: - data-directory - where your fastq files can be found (WITHOUT PRIMER)\ - error-function - to determine which sequences are just sequencing errors, dada2 uses models to determine how likely a nucletide switch is to occur. This works fine with sequencing methods such as miseq, where the quality of each bp is reported as a continuous interval. In those cases you can ignore this argument. However, if you have a sequencing method that reports binned qualities (such as novaseq), those models have to be modified. In that case, specify you want to use a custom-loess function. See the section How do I > Choose an error function to see how to decide this\ - output - directory to export your sequence tables, as well as several diagnostic plots

This example assumed you have novaseq data:

sh Rscript ./scripts/02-determine-sample-sequences-R --data-directory ./output/sequencing-noprimer --error-function custom-loess --output ./outputs/sequence-tables

STEP 3 - Determine taxonomies This step is implemented with dada2's Naive Bayes classifier. You require:

sequence-table - Path to the file sequence-table.RDS, which was exported in the previous step.\
reference-sequences - Dataset used to train your classifier. There are some pre-prepared datasets for you to use, maintained by dada2's team\
output - Where should your classification table be saved to?

How do I?

Choose a truncate length?

When you run 02-determine-sample-sequences.R, you will be prompted to provide a to truncate your sequences at. You should also be shown a plot that examined sequence quality vs sequence length. Find a point where the quality seems to drop sharply, or below what is acceptable for you and that's it. See example:

sequence-quality

Choose an error function

If you have binned quality data (see image above, notice how there are 3 distinct gray bands. Those are a heatmap, with clear discrete bins), your error models will be unstable and "wiggly". You can see this in the plot 01-error-model-fit.png. See the comparison below, where the standard error model is on the left and the modified one on the right:

error-models

You want to aim for stable, monotonic lines. Also, be aware that I did not come up with this solution, it was lifted from option #4 in this github comment. The entire thread is worth a read, along with this one.

Roadmap

[ ] Add code to create custom train datasets using virtualPCR - based on Nathan Geraldi's code

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Márcio Martins - \@MarciofcMartins - mfcmartins\@ualg.pt{.email} - ResearchGate

Owner

Name: Márcio Martins
Login: MarcioFCMartins
Kind: user
Location: Faro
Company: CCMAR

Twitter: MarciofcMartins
Repositories: 3
Profile: https://github.com/MarcioFCMartins

Marine biologist at CCMAR (University of Faro, Algarve, Portugal)

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Sedimentary DNA metabarcoding pipeline
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Marcio
    family-names: Martins
    email: mfcmartins@ualg.pt
    affiliation: Centre of Marine Sciences
    orcid: 'https://orcid.org/0000-0002-6969-2215'
version: 0.1
date-released: 2022-06-02
url: "https://github.com/MarcioFCMartins/eDNA-metabarcoding"