clean
A nextflow pipeline for decontamination of short reads, long reads and contigs
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov, nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary
Repository
A nextflow pipeline for decontamination of short reads, long reads and contigs
Basic Info
- Host: GitHub
- Owner: rki-mf1
- License: bsd-3-clause
- Language: Nextflow
- Default Branch: main
- Size: 97.7 MB
Statistics
- Stars: 47
- Watchers: 5
- Forks: 5
- Open Issues: 7
- Releases: 14
Metadata Files
README.md
Clean your data!
A reference-based decontamination workflow for short reads, long reads, and assemblies.
Email: hoelzerm@rki.de, lataretum@rki.de
Objective
Sequencing data is often contaminated with DNA or RNA from other species. These, normally unwanted, material occurs for biological reasons or can be also spiked in as a control. For example, this is often the case for Illumina data (phiX phage) or Oxford Nanopore Technologies (DNA CS (DCS), yeast ENO2). Most tools don't take care of such contaminations and thus we can find them in sequence collections and asssemblies (Mukherjee et al. (2015)).
What this workflow does for you
With this workflow you can screen and clean your Illumina, Nanopore, PacBio CLR or any FASTA-formated sequence data. The results are the clean sequences and the sequences identified as contaminated. Per default minimap2 is used for aligning your sequences to reference sequences (with the map-ont settings for Nanopore data, map-bp for PacBio CLR data, and sr settings for short-read data activated automatically). However, for short-read data, you may want to switch to BWA (--bwa). As another alternative, we provide bbduk, part of BBTools, as a kmer-based approach (--bbduk). However, no mapping file will be produced with bbduk and thus some subsequent statistics are not calculated.
You can simply specify provided hosts and controls for the cleanup or use your own FASTA files. The reads are then mapped (or kmer-based compared in case of bbduk) against the specified host, control, and user defined FASTA files. All reads that match are considered as contamination. In case of Illumina paired-end reads, both mates need to be aligned (singleton files will be produced otherwise).
The read input is defined via --input_type nano for Nanopore, --input_type pacbio for PacBio CLR, and --input_type illumina or --input_type illumina_single_end for Illumina reads. Additional control(s) for decontamination can be defined via --control. If controls are defined, they are selectively concatenated with the host and potential own FASTA files for decontamination. We provide auto-download for the following controls: dcs for Nanopore DNA-Seq, eno for Nanopore RNA-Seq, and phix from Illumina data. In general, specified host, control, and user defined FASTA files are concatenated for decontamination.
Filter soft-clipped contamination reads
We saw many soft-clipped reads after the mapping, that probably aren't contamination. With --min_clip the user can set a threshold for the number of soft-clipped positions (sum of both ends). If --min_clip is greater 1, the total number is considered, else the fraction of soft-clipped positions to the read length. The output consists of all mapped, soft-clipped, and mapped reads passing the filer.
Requirements
Workflow management
Dependencies management
For dependency handling you have to use one of the following technologies:
As default docker is used; to switch to another technology for dependency handling, e.g., mamba, use -profile mamba.
Run engine
Per default we assume you are running the tool on a laptop or work station (local). You can change the pipeline behaviour for example when running on a HPC with the SLURM workload manager via -profile slurm.
Dependencies and run engines can be combined, e.g., to run with Singularity on LSF use -profile singularity,lsf.
Execution examples
Get or update the workflow:
bash
nextflow pull rki-mf1/clean
Get help:
bash
nextflow run rki-mf1/clean --help
We always recommend running a release version. Check for latest releases! In these examples we use release -r v1.1.0:
```bash
check available release versions and branches
nextflow info rki-mf1/clean
select a release and run it to show the help
nextflow run rki-mf1/clean -r v1.1.0 --help ```
Clean Nanopore data by filtering against a combined reference of the E. coli genome and the Nanopore DNA CS spike-in.
```bash
uses Docker per default
nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \ --host eco --control dcs
use mamba instead of Docker
nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \ --host eco --control dcs -profile mamba ```
Clean Illumina paired-end data against your own reference FASTA using bbduk instead of minimap2.
```bash
we have to define the $HOME specifically here, not sure why
nextflow run rki-mf1/clean -r v1.1.0 --input_type illumina --input $HOME/'.nextflow/assets/rki-mf1/clean/test/illumina*.R{1,2}.fastq.gz' \ --own ~/.nextflow/assets/rki-mf1/clean/test/ref.fasta.gz --bbduk ```
Supported species and control sequences
Currently supported are:
|flag | species | source| |-----|---------|-------| |hsa | Homo sapiens | [Ensembl: Homosapiens.GRCh38.dna.primaryassembly, incl. mtDNA] | |t2t | Homo sapiens | [T2T Consortium: T2T-CHM13v2.0 (T2T-CHM13+Y, file name: GCA009914755.4T2T-CHM13v2.0genomic), datasets released along the v2.0 (T2T-CHM13) and the T2T-Y chromosome, see paper, incl. mtDNA] | |mmu | _Mus musculus | [Ensembl: Musmusculus.GRCm38.dna.primaryassembly, incl. mtDNA] | |csa | Chlorocebus sabeus | [NCBI: GCF000409795.2Chlorocebussabeus1.1genomic, incl. mtDNA] | |gga | _Gallus gallus | [NCBI: Gallusgallus.GRCg6a.dna.toplevel, incl. mtDNA] | |cli | _Columba livia | [NCBI: GCF000337935.1Cliv1.0genomic, incl. mtDNA] | |eco | Escherichia coli | [Ensembl: Escherichiacolik12.ASM80076v1.dna.toplevel] | |sc2 | _SARS-CoV-2 | [ENA Sequence: MN908947.3 (Wuhan-Hu-1 complete genome) web fasta] |
Controls included in this repository are:
|flag | recommended usage | control/spike | source | |-----|-|---------|-------| | dcs | ONT DNA-Seq reads |3.6 kb standard amplicon mapping the 3' end of the Lambda genome| https://assets.ctfassets.net/hkzaxo8a05x5/2IX56YmF5ug0kAQYoAg2Uk/159523e326b1b791e3b842c4791420a6/DNACS.txt | | eno | ONT RNA-Seq reads |yeast ENO2 Enolase II of strain S288C, YHR174W| https://raw.githubusercontent.com/rki-mf1/clean/master/controls/S288CYHR174WENO2coding.fsa | | phix| Illumina reads |enterobacteriaphagephix174sensulatouid14015, NC001422| ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/enterobacteriaphagephix174sensulatouid14015/NC001422.fna |
... for reasons. More can be easily added! Just write us, add an issue, or make a pull request.
Workflow

The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates & nf-core under a CCO license (public domain).
Results
Running the pipeline will create a directory called results/ (can be changed via --output) in the current directory with some or all of the following directories and files (plus additional files for indices, ...):
text
results/
├── clean/
│ └── <sample_name>.fastq.gz
├── removed/
│ └── <sample_name>.fastq.gz
├── intermediate/
│ ├── map-to-remove/
│ │ ├── <sample_name>.mapped.fastq.gz
│ │ ├── <sample_name>.unmapped.fastq.gz
│ │ ├── <sample_name>.sorted.bam
│ │ ├── <sample_name>.sorted.bam.bai
│ │ ├── <sample_name>.sorted.flagstat.txt
│ │ ├── <sample_name>.sorted.idxstats.tsv
│ │ ├── strict-dcs/
│ │ │ ├── <sample_name>.no-dcs.bam
│ │ │ ├── <sample_name>.true-dcs.bam
│ │ │ └── <sample_name>.false-dcs.bam
│ │ └── soft-clipped/
│ │ ├── <sample_name>.soft-clipped.bam
│ │ └── <sample_name>.passed-clipped.bam
│ ├── map-to-keep/
│ │ ├── <sample_name>.mapped.fastq.gz
│ │ ├── <sample_name>.unmapped.fastq.gz
│ │ ├── <sample_name>.sorted.bam
│ │ ├── <sample_name>.sorted.bam.bai
│ │ ├── <sample_name>.sorted.flagstat.txt
│ │ ├── <sample_name>.sorted.idxstats.tsv
│ │ ├── strict-dcs/
│ │ │ ├── <sample_name>.no-dcs.bam
│ │ │ ├── <sample_name>.true-dcs.bam
│ │ │ └── <sample_name>.false-dcs.bam
│ │ └── soft-clipped/
│ │ ├── <sample_name>.soft-clipped.bam
│ │ └── <sample_name>.passed-clipped.bam
| ├── host.fa.fai
| └── host.fa.gz
├── logs/*.html
└── qc/multiqc_report.html
The most important files you are likely interested in are results/clean/<sample_name>.fastq.gz, which are the "cleaned" reads. These are the input reads that do not map to the host, control, own fasta or rRNA files (or the subset of these that you provided), plus those reads that map to the "keep" sequence if you used the --keep option. Any files that were removed from your input fasta file are placed in results/removed/<sample_name>.fastq.gz.
For debugging purposes we also provide various intermediate results in the intermediate/ folder. For mapping-based approaches (minimap2, bwa), you will also find a brief summary of mapped/unmapped reads and their proportions.
Acknowledgements
- Thanks to Matt Huska (@matthuska) for extensive testing of
CLEAN, bug fixing, and reorganizing the output. - Thanks to Ayorinde Afolayan (@ayoraind) for valuable feedback and a pull request adding a simple summary table for mapping-based approaches.
Citations
If you use CLEAN in your work, please consider citing our preprint:
Targeted decontamination of sequencing data with CLEAN Marie Lataretu, Sebastian Krautwurst, Adrian Viehweger, Christian Brandt, Martin Hölzer bioRxiv 2023.08.05.552089; doi: https://doi.org/10.1101/2023.08.05.552089
Additionally, an extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file. Please consider also citing these tools because w/o them there would be no CLEAN!
Owner
- Name: RKI MF1 Bioinformatics
- Login: rki-mf1
- Kind: organization
- Location: Germany
- Repositories: 9
- Profile: https://github.com/rki-mf1
Bioinformatics code of MF1
Citation (CITATIONS.md)
# CLEAN: Citations ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [BBMap](https://sourceforge.net/projects/bbmap/) - [BEDTools](https://www.ncbi.nlm.nih.gov/pubmed/20110278/) > Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824. - [BWA](https://arxiv.org/abs/1303.3997) > Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN] - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) - [Minimap2](https://pubmed.ncbi.nlm.nih.gov/29750242/) > Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191. PMID: 29750242; PMCID: PMC6137996. - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. - [NanoPlot](https://pubmed.ncbi.nlm.nih.gov/37171891/) > Wouter C, Rademakers R. NanoPack2: Population scale evaluation of long-read sequencing data. Bioinformatics. 2023 May 12;39(5):btad311. doi: 10.1093/bioinformatics/btad311. Epub ahead of print. PMID: 37171891; PMCID: PMC10196664. - [SAMtools](https://www.ncbi.nlm.nih.gov/pubmed/19505943/) > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002. - [SeqKit](https://pubmed.ncbi.nlm.nih.gov/27706213/) > Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016 Oct 5;11(10):e0163962. doi: 10.1371/journal.pone.0163962. PMID: 27706213; PMCID: PMC5051824. - [QUAST](https://pubmed.ncbi.nlm.nih.gov/23422339/) > Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Create event: 6
- Release event: 3
- Issues event: 12
- Watch event: 18
- Delete event: 6
- Issue comment event: 13
- Push event: 25
- Pull request review event: 2
- Pull request event: 18
- Fork event: 2
Last Year
- Create event: 6
- Release event: 3
- Issues event: 12
- Watch event: 18
- Delete event: 6
- Issue comment event: 13
- Push event: 25
- Pull request review event: 2
- Pull request event: 18
- Fork event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: 4 days
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 0
- Average time to close issues: 4 days
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- hoelzer (7)
- matthuska (4)
- MarieLataretu (3)
- ayoraind (2)
- bruzecruise (1)
- kp4918 (1)
- haotianteng (1)
- selmapichot (1)
Pull Request Authors
- hoelzer (12)
- matthuska (9)
- MarieLataretu (4)
- ayoraind (1)