clean

A nextflow pipeline for decontamination of short reads, long reads and contigs

https://github.com/rki-mf1/clean

Last synced: 9 months ago · JSON representation ·

Repository

A nextflow pipeline for decontamination of short reads, long reads and contigs

Basic Info

Host: GitHub
Owner: rki-mf1
License: bsd-3-clause
Language: Nextflow
Default Branch: main
Size: 97.7 MB

Statistics

Stars: 47
Watchers: 5
Forks: 5
Open Issues: 7
Releases: 14

Created over 6 years ago · Last pushed 12 months ago

Metadata Files

Readme Changelog License Citation

Clean your data!

A reference-based decontamination workflow for short reads, long reads, and assemblies.

Email: hoelzerm@rki.de, lataretum@rki.de

Objective

Sequencing data is often contaminated with DNA or RNA from other species. These, normally unwanted, material occurs for biological reasons or can be also spiked in as a control. For example, this is often the case for Illumina data (phiX phage) or Oxford Nanopore Technologies (DNA CS (DCS), yeast ENO2). Most tools don't take care of such contaminations and thus we can find them in sequence collections and asssemblies (Mukherjee et al. (2015)).

What this workflow does for you

With this workflow you can screen and clean your Illumina, Nanopore, PacBio CLR or any FASTA-formated sequence data. The results are the clean sequences and the sequences identified as contaminated. Per default minimap2 is used for aligning your sequences to reference sequences (with the map-ont settings for Nanopore data, map-bp for PacBio CLR data, and sr settings for short-read data activated automatically). However, for short-read data, you may want to switch to BWA (--bwa). As another alternative, we provide bbduk, part of BBTools, as a kmer-based approach (--bbduk). However, no mapping file will be produced with bbduk and thus some subsequent statistics are not calculated.

You can simply specify provided hosts and controls for the cleanup or use your own FASTA files. The reads are then mapped (or kmer-based compared in case of bbduk) against the specified host, control, and user defined FASTA files. All reads that match are considered as contamination. In case of Illumina paired-end reads, both mates need to be aligned (singleton files will be produced otherwise).

The read input is defined via --input_type nano for Nanopore, --input_type pacbio for PacBio CLR, and --input_type illumina or --input_type illumina_single_end for Illumina reads. Additional control(s) for decontamination can be defined via --control. If controls are defined, they are selectively concatenated with the host and potential own FASTA files for decontamination. We provide auto-download for the following controls: dcs for Nanopore DNA-Seq, eno for Nanopore RNA-Seq, and phix from Illumina data. In general, specified host, control, and user defined FASTA files are concatenated for decontamination.

Filter soft-clipped contamination reads

We saw many soft-clipped reads after the mapping, that probably aren't contamination. With --min_clip the user can set a threshold for the number of soft-clipped positions (sum of both ends). If --min_clip is greater 1, the total number is considered, else the fraction of soft-clipped positions to the read length. The output consists of all mapped, soft-clipped, and mapped reads passing the filer.

Requirements

Workflow management

Nextflow

Dependencies management

For dependency handling you have to use one of the following technologies:

As default docker is used; to switch to another technology for dependency handling, e.g., mamba, use -profile mamba.

Run engine

Per default we assume you are running the tool on a laptop or work station (local). You can change the pipeline behaviour for example when running on a HPC with the SLURM workload manager via -profile slurm.

Dependencies and run engines can be combined, e.g., to run with Singularity on LSF use -profile singularity,lsf.

Execution examples

Get or update the workflow:

bash nextflow pull rki-mf1/clean

Get help:

bash nextflow run rki-mf1/clean --help

We always recommend running a release version. Check for latest releases! In these examples we use release -r v1.1.0:

```bash

check available release versions and branches

nextflow info rki-mf1/clean

select a release and run it to show the help

nextflow run rki-mf1/clean -r v1.1.0 --help ```

Clean Nanopore data by filtering against a combined reference of the E. coli genome and the Nanopore DNA CS spike-in.

```bash

uses Docker per default

nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \ --host eco --control dcs

use mamba instead of Docker

nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \ --host eco --control dcs -profile mamba ```

Clean Illumina paired-end data against your own reference FASTA using bbduk instead of minimap2.

```bash

we have to define the $HOME specifically here, not sure why

nextflow run rki-mf1/clean -r v1.1.0 --input_type illumina --input $HOME/'.nextflow/assets/rki-mf1/clean/test/illumina*.R{1,2}.fastq.gz' \ --own ~/.nextflow/assets/rki-mf1/clean/test/ref.fasta.gz --bbduk ```

Supported species and control sequences

Currently supported are:

|flag | species | source| |-----|---------|-------| |hsa | Homo sapiens | [Ensembl: Homosapiens.GRCh38.dna.primaryassembly, incl. mtDNA] | |t2t | Homo sapiens | [T2T Consortium: T2T-CHM13v2.0 (T2T-CHM13+Y, file name: GCA009914755.4T2T-CHM13v2.0genomic), datasets released along the v2.0 (T2T-CHM13) and the T2T-Y chromosome, see paper, incl. mtDNA] | |mmu | _Mus musculus | [Ensembl: Musmusculus.GRCm38.dna.primaryassembly, incl. mtDNA] | |csa | Chlorocebus sabeus | [NCBI: GCF000409795.2Chlorocebussabeus1.1genomic, incl. mtDNA] | |gga | _Gallus gallus | [NCBI: Gallusgallus.GRCg6a.dna.toplevel, incl. mtDNA] | |cli | _Columba livia | [NCBI: GCF000337935.1Cliv1.0genomic, incl. mtDNA] | |eco | Escherichia coli | [Ensembl: Escherichiacolik12.ASM80076v1.dna.toplevel] | |sc2 | _SARS-CoV-2 | [ENA Sequence: MN908947.3 (Wuhan-Hu-1 complete genome) web fasta] |

Controls included in this repository are:

|flag | recommended usage | control/spike | source | |-----|-|---------|-------| | dcs | ONT DNA-Seq reads |3.6 kb standard amplicon mapping the 3' end of the Lambda genome| https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5ug0kAQYoAg2Uk/159523e326b1b791e3b842c4791420a6/DNACS.txt | | eno | ONT RNA-Seq reads |yeast ENO2 Enolase II of strain S288C, YHR174W| https://raw.githubusercontent.com/rki-mf1/clean/master/controls/S288CYHR174WENO2coding.fsa | | phix| Illumina reads |enterobacteriaphagephix174sensulatouid14015, NC001422| ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/enterobacteriaphagephix174sensulatouid14015/NC001422.fna |

... for reasons. More can be easily added! Just write us, add an issue, or make a pull request.

Workflow

chart

_{_{The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates & nf-core under a CCO license (public domain).}}

Results

Running the pipeline will create a directory called results/ (can be changed via --output) in the current directory with some or all of the following directories and files (plus additional files for indices, ...):

text results/ ├── clean/ │ └── <sample_name>.fastq.gz ├── removed/ │ └── <sample_name>.fastq.gz ├── intermediate/ │ ├── map-to-remove/ │ │ ├── <sample_name>.mapped.fastq.gz │ │ ├── <sample_name>.unmapped.fastq.gz │ │ ├── <sample_name>.sorted.bam │ │ ├── <sample_name>.sorted.bam.bai │ │ ├── <sample_name>.sorted.flagstat.txt │ │ ├── <sample_name>.sorted.idxstats.tsv │ │ ├── strict-dcs/ │ │ │ ├── <sample_name>.no-dcs.bam │ │ │ ├── <sample_name>.true-dcs.bam │ │ │ └── <sample_name>.false-dcs.bam │ │ └── soft-clipped/ │ │ ├── <sample_name>.soft-clipped.bam │ │ └── <sample_name>.passed-clipped.bam │ ├── map-to-keep/ │ │ ├── <sample_name>.mapped.fastq.gz │ │ ├── <sample_name>.unmapped.fastq.gz │ │ ├── <sample_name>.sorted.bam │ │ ├── <sample_name>.sorted.bam.bai │ │ ├── <sample_name>.sorted.flagstat.txt │ │ ├── <sample_name>.sorted.idxstats.tsv │ │ ├── strict-dcs/ │ │ │ ├── <sample_name>.no-dcs.bam │ │ │ ├── <sample_name>.true-dcs.bam │ │ │ └── <sample_name>.false-dcs.bam │ │ └── soft-clipped/ │ │ ├── <sample_name>.soft-clipped.bam │ │ └── <sample_name>.passed-clipped.bam | ├── host.fa.fai | └── host.fa.gz ├── logs/*.html └── qc/multiqc_report.html

The most important files you are likely interested in are results/clean/<sample_name>.fastq.gz, which are the "cleaned" reads. These are the input reads that do not map to the host, control, own fasta or rRNA files (or the subset of these that you provided), plus those reads that map to the "keep" sequence if you used the --keep option. Any files that were removed from your input fasta file are placed in results/removed/<sample_name>.fastq.gz.

For debugging purposes we also provide various intermediate results in the intermediate/ folder. For mapping-based approaches (minimap2, bwa), you will also find a brief summary of mapped/unmapped reads and their proportions.

Acknowledgements

Thanks to Matt Huska (@matthuska) for extensive testing of CLEAN, bug fixing, and reorganizing the output.
Thanks to Ayorinde Afolayan (@ayoraind) for valuable feedback and a pull request adding a simple summary table for mapping-based approaches.

Citations

If you use CLEAN in your work, please consider citing our preprint:

Targeted decontamination of sequencing data with CLEAN Marie Lataretu, Sebastian Krautwurst, Adrian Viehweger, Christian Brandt, Martin Hölzer bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089

Additionally, an extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file. Please consider also citing these tools because w/o them there would be no CLEAN!

Owner

Name: RKI MF1 Bioinformatics
Login: rki-mf1
Kind: organization
Location: Germany

Repositories: 9
Profile: https://github.com/rki-mf1

Bioinformatics code of MF1

Citation (CITATIONS.md)

# CLEAN: Citations

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [BBMap](https://sourceforge.net/projects/bbmap/)
  
- [BEDTools](https://www.ncbi.nlm.nih.gov/pubmed/20110278/)

  > Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.

- [BWA](https://arxiv.org/abs/1303.3997)

  > Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [Minimap2](https://pubmed.ncbi.nlm.nih.gov/29750242/)
  > Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191. PMID: 29750242; PMCID: PMC6137996.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [NanoPlot](https://pubmed.ncbi.nlm.nih.gov/37171891/)

  > Wouter C, Rademakers R. NanoPack2: Population scale evaluation of long-read sequencing data. Bioinformatics. 2023 May 12;39(5):btad311. doi: 10.1093/bioinformatics/btad311. Epub ahead of print. PMID: 37171891; PMCID: PMC10196664.

- [SAMtools](https://www.ncbi.nlm.nih.gov/pubmed/19505943/)

  > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

- [SeqKit](https://pubmed.ncbi.nlm.nih.gov/27706213/)

  > Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. PMID: 27706213; PMCID: PMC5051824.

- [QUAST](https://pubmed.ncbi.nlm.nih.gov/23422339/)

  > Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Create event: 6
Release event: 3
Issues event: 12
Watch event: 18
Delete event: 6
Issue comment event: 13
Push event: 25
Pull request review event: 2
Pull request event: 18
Fork event: 2

Last Year

Create event: 6
Release event: 3
Issues event: 12
Watch event: 18
Delete event: 6
Issue comment event: 13
Push event: 25
Pull request review event: 2
Pull request event: 18
Fork event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

hoelzer (7)
matthuska (4)
MarieLataretu (3)
ayoraind (2)
bruzecruise (1)
kp4918 (1)
haotianteng (1)
selmapichot (1)

clean

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Clean your data!

Objective

What this workflow does for you

Filter soft-clipped contamination reads

Requirements

Workflow management

Dependencies management

Run engine

Execution examples

check available release versions and branches

select a release and run it to show the help

uses Docker per default

use mamba instead of Docker

we have to define the $HOME specifically here, not sure why

Supported species and control sequences

Workflow

Results

Acknowledgements

Citations

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels