qc

pipeline for quality checking of both raw and annotated assemblies

https://github.com/henry-schober/qc

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov, plos.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

pipeline for quality checking of both raw and annotated assemblies

Basic Info

Host: GitHub
Owner: henry-schober
License: mit
Language: Nextflow
Default Branch: main
Homepage:
Size: 4.35 MB

Statistics

Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

Introduction

Argonaut performs automated reads to genome operations for de novo assemblies; it is a bioinformatics pipeline that performs genome assembly on long and short read data. A fastq file and input information is fed to the pipeline, resulting in final assemblies with completeness, contiguity, and correctnesss quality checking at each step. The pipeline accepts short reads, long reads, or both.

Pipeline Summary
Quick Start
Output Overview
Credits
Contributions & Support
Citations

Pipeline Summary

Illumina Short Read 1. Read QC, Adaptor Trimming, Contaminant Filtering(FastQC v0.11.9, FastP v0.23.4, GenomeScope2 v2.0,Jellyfish v2.2.6,Kraken2 v2.1.2, Recentrifuge v1.9.1,)

PacBio HiFi Long Read (CCS format) 1. Read QC, Adaptor Trimming, Contaminant Filtering(Nanoplot v1.41.0,CutAdapt v3.4,GenomeScope2 v2.0,Jellyfish v2.2.6,Kraken2 v2.1.2, Recentrifuge v1.9.1) 2. Length Filtering (optional)(Seqkit v2.4.0, Nanoplot v1.41.0)

ONT Long Read 1. Read QC and Contaminant Filtering(Nanoplot v1.41.0,KmerFreq, GCE, Centrifuge v1.0.4, Recentrifuge v1.9.1) 2. Length Filtering (optional)(Seqkit v2.4.0, Nanoplot v1.41.0)

All reads are used for the following steps:
Argonaut Hybrid Workflow

To the right is a figure detailing the major workflow steps involved in hybrid assembly.

If you indicate that you would like for long read polishers to be run, the pipeline will default to using PacBio HiFi reads, and using ONT if no PacBio HiFi is available. If short reads are also available, they will automatically be used to polish the assemblies after long read polishing (or assembly if long read polishing is off).

Purge Haplotigs is the first step of manual curation, as it produces a histogram that needs to be analyzed for -l, -m, -h flags. The pipeline will stop at the purge step if purge is activated in your configuration and wait for manual input of parameters according to the histogram of your assembly, which can be found in your out directory.

Quick Start

Installation

Only Nextflow and Singularity need to be installed to run Argonaut. Users that would like to run Centrifuge and/or Kraken2 will need to provide a database. There are similar restrictions for running Recentrifuge and Blobtools with Blast and NCBI taxdump. Follow the links provided for database download directions. Xanadu users running Argonaut at the University of Connecticut may use the database paths provided in the example params.yaml

Samplesheets

To get started setting up your run, prepare a samplesheet with your input data as follows:

samplesheet.csv:

csv sample,fastq_1,fastq_2,single_end,read_type chr3_gibbon_pb,/core/projects/EBP/conservation/gen_assembly_pipeline/hoolock_chrm_3/chr3_pb.fastq.gz,,TRUE,pb chr3_gibbon_ont,/core/projects/EBP/conservation/gen_assembly_pipeline/hoolock_chrm_3/chr3_ont.fastq.gz,,TRUE,ont chr3_gibbon_ill,/core/projects/EBP/conservation/gen_assembly_pipeline/hoolock_chrm_3/chr3_ill_R1.paired.fastq.gz,/core/projects/EBP/conservation/gen_assembly_pipeline/hoolock_chrm_3/chr3_ill_R2.paired.fastq.gz,FALSE,ill

!!! PLEASE ADD "ont", "pb", AND/OR "ill" TO YOUR SAMPLES NAMES !!! Failure to do so may result in assemblers not recognizing your read type and/or outputs being overwritten.

The sample name inputted in your samplesheet will serve as the prefix for your output files. Please indicate which kind of read is being inputted in the sample name, as well as the read type column.

After you have your samplesheet, create a params.yaml file to specify the paths to your samplesheet, contaminant databases, etc. Most likely, a config file will also need to be made to modify the default settings of the pipeline. Please look through the nextflow.config file to browse the defaults and specify which you would like to change in your my_config file. More information is located in the usage section.

Now, you can run the pipeline using:

bash nextflow run emilytrybulec/argonaut \ -r main \ -params-file params.yaml \ -c my_config \ -profile singularity,xanadu \

Pipeline output

All of the output from the programs run in the pipeline pipeline will be located in the out directory specified in params.yaml. The pipeline produces the following labeled directories depending on configurations:

├── 01 READ QC │ ├── centrifuge │ ├── fastp │ ├── fastqc │ ├── genome size est │ │ ├── genomescope2 │ │ ├── jellyfish │ │ ├── ont gce │ │ ├── ont kmerfreq │ ├── kraken2 │ ├── nanoplot │ ├── pacbio cutadapt ├── 02 ASSEMBLY │ ├── hybrid │ ├── long read │ ├── short read ├── 03 POLISH │ ├── hybrid │ │ ├── polca │ ├── long read │ │ ├── medaka │ │ ├── racon ├── 04 PURGE │ ├── align │ ├── histogram │ ├── purge haplotigs │ ├── short read redundans ├── 05 SCAFFOLD ├── ASSEMBLY QC │ ├── busco │ ├── bwamem2 │ ├── merqury │ ├── minimap2 │ ├── quast │ ├── samtools ├── OUTPUT │ ├── blobtools visualization │ ├── coverage │ ├── genome size estimation │ ├── *assemblyStats.txt ├── PIPELINE INFO └── execution_trace_*.txt Some output files have labels such as "dc", indicating that the reads have been decontaminated, or "lf", indicating that reads have been length filtered.

Information about interpreting output is located in the output section.

Credits

emilytrybulec/genomeassembly was originally written by Emily Trybulec.

I thank the following people for their extensive assistance in the development of this pipeline:

University of Connecticut:

Biodiversity and Conservation Genomics Center
- Jill Wegrzyn
- Cynthia Webster
- Anthony He
- Laurel Humphrey
- Keertana Chagari
- Amanda Mueller
- Cristopher Guzman
- Harshita Akella

Rachel O'Neill Lab
- Rachel O’Neill
- Michelle Neitzey
- Nicole Pauloski
- Vel Johnston

Computational Biology Core
- Noah Reid
- Gabe Barrett
nf-core Community
Zbigniew Trybulec

Contributions and Support

Development of this pipeline was funded by the University of Connecticut Office of Undergraduate Research through the Summer Undergraduate Research Fund (SURF) Grant.

The Biodiversity and Conservation Genomics Center is a part of the Earth Biogenome Project, working towards capturing the genetic diversity of life on Earth.

Citations

Argonaut is currently unpublished. For now, please use the GitHub URL when referencing.

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Login: henry-schober
Kind: user

Repositories: 1
Profile: https://github.com/henry-schober

Citation (CITATIONS.md)

# emilytrybulec/genomeassembly: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [Bioawk](https://github.com/lh3/bioawk)

- [BUSCO](https://busco.ezlab.org/)

  > Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics (Oxford, England), 31(19), 3210–3212. https://doi.org/10.1093/bioinformatics/btv351

- [Canu](https://github.com/marbl/canu)

  > Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., & Phillippy, A. M. (2017). Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5), 722–736. https://doi.org/10.1101/gr.215087.116

- [Centrifuge](http://www.ccb.jhu.edu/software/centrifuge/)

  > Kim, D., Song, L., Breitwieser, F. P., & Salzberg, S. L. (2016). Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research, 26(12), 1721–1729. https://doi.org/10.1101/gr.210641.116

- [Fastp](https://github.com/OpenGene/fastp)

  > Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107

  > Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [Flye](https://github.com/fenderglass/Flye)

  > Kolmogorov, M., Yuan, J., Lin, Y., & Pevzner, P. A. (2019). Assembly of long, error-prone reads using repeat graphs. Nature biotechnology, 37(5), 540–546. https://doi.org/10.1038/s41587-019-0072-8

- [GCE](https://github.com/fanagislab/GCE)

  > Binghang Liu, Yujian Shi, Jianying Yuan, et al. and Wei Fan*. (2013). Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome project. arXiv.org arXiv: 1308.2012. 

- [GenomeScope2](https://github.com/tbenavi1/genomescope2.0)

  > Ranallo-Benavidez, T. R., Jaron, K. S., & Schatz, M. C. (2020). GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature communications, 11(1), 1432. https://doi.org/10.1038/s41467-020-14998-3

- [Gzip](https://www.gzip.org/)

- [Jellyfish](https://github.com/gmarcais/Jellyfish)

  > Guillaume Marcais and Carl Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 (first published online January 7, 2011) doi:10.1093/bioinformatics/btr011

- [Kmerfreq](https://github.com/fanagislab/kmerfreq)

  > Hengchao Wang, Bo Liu, Yan Zhang, Fan Jiang, Yuwei Ren, Lijuan Yin, Hangwei Liu, Sen Wang, Wei Fan. (2020). Estimation of genome size using k-mer frequencies from corrected long reads. arXiv:2003.11817 [q-bio.GN] 

  > Binghang Liu, Yujian Shi, Jianying Yuan, et al. and Wei Fan*. (2013). Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome project. arXiv.org arXiv: 1308.2012. 

- [Kraken2](https://github.com/DerrickWood/kraken2/tree/master)

  > Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome biology, 20(1), 257. https://doi.org/10.1186/s13059-019-1891-0

- [MaSuRCA](https://github.com/alekseyzimin/masurca)

  > Zimin, A. V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S. L., & Yorke, J. A. (2013). The MaSuRCA genome assembler. Bioinformatics (Oxford, England), 29(21), 2669–2677. https://doi.org/10.1093/bioinformatics/btt476

- [Medaka](https://github.com/nanoporetech/medaka)

- [Merqury](https://github.com/marbl/merqury)

  > Rhie, A., Walenz, B.P., Koren, S. et al. (2020).Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245. https://doi.org/10.1186/s13059-020-02134-9

- [Meryl](https://github.com/marbl/meryl)

  > Rhie, A., Walenz, B.P., Koren, S. et al. (2020).Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245. https://doi.org/10.1186/s13059-020-02134-9

- [Minimap2](https://github.com/lh3/minimap2/tree/master)

  > Li, H. (2021). New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705

  > Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191

- [Nanoplot](https://github.com/wdecoster/NanoPlot)

  > De Coster, W., D'Hert, S., Schultz, D. T., Cruts, M., & Van Broeckhoven, C. (2018). NanoPack: visualizing and processing long-read sequencing data. Bioinformatics (Oxford, England), 34(15), 2666–2669. https://doi.org/10.1093/bioinformatics/bty149

- [Numfmt](https://github.com/borgar/numfmt)

- [POLCA](https://github.com/alekseyzimin/masurca)

  > Zimin, A. V., & Salzberg, S. L. (2020). The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS computational biology, 16(6), e1007981. https://doi.org/10.1371/journal.pcbi.1007981

- [Purge](https://bitbucket.org/mroachawri/purge_haplotigs/src/master/)

  > Roach, M. J., Schmidt, S. A., & Borneman, A. R. (2018). Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC bioinformatics, 19(1), 460. https://doi.org/10.1186/s12859-018-2485-7

- [PycoQC](https://github.com/a-slide/pycoQC)

  > Leger, Adrien & Leonardi, Tommaso. (2019). pycoQC, interactive quality control for Oxford Nanopore Sequencing. Journal of Open Source Software, 4(34), 1236. https://doi.org/10.21105/joss.01236

- [Quast](https://quast.sourceforge.net/)

  > Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics (Oxford, England), 29(8), 1072–1075. https://doi.org/10.1093/bioinformatics/btt086

- [Ragtag](https://github.com/malonge/RagTag)

  > Alonge, M., Lebeigle, L., Kirsche, M., Jenike, K., Ou, S., Aganezov, S., Wang, X., Lippman, Z. B., Schatz, M. C., & Soyk, S. (2022). Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome biology, 23(1), 258. https://doi.org/10.1186/s13059-022-02823-7

- [Recentrifuge](https://github.com/khyox/recentrifuge)

  > Martí J.M. (2019). Recentrifuge: Robust comparative analysis and contamination removal for metagenomics. PLOS Computational Biology 15(4): e1006967. https://doi.org/10.1371/journal.pcbi.1006967

- [Samtools](https://github.com/samtools/samtools)

  > Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li. (February 2021) GigaScience, Volume 10, Issue 2, giab008, https://doi.org/10.1093/gigascience/giab008


- [Seqkit](https://github.com/shenwei356/seqkit)

  > W Shen, S Le, Y Li*, F Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Member event: 1
Push event: 7
Pull request event: 2

Last Year

Member event: 1
Push event: 7
Pull request event: 2

Dependencies

.github/workflows/awsfulltest.yml actions

actions/upload-artifact v3 composite
seqeralabs/action-tower-launch v1 composite

.github/workflows/awstest.yml actions

actions/upload-artifact v3 composite
seqeralabs/action-tower-launch v1 composite

.github/workflows/branch.yml actions

mshick/add-pr-comment v1 composite

.github/workflows/ci.yml actions

actions/checkout v3 composite
nf-core/setup-nextflow v1 composite

.github/workflows/clean-up.yml actions

actions/stale v7 composite

.github/workflows/fix-linting.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite

.github/workflows/linting.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite
mshick/add-pr-comment v1 composite
nf-core/setup-nextflow v1 composite
psf/black stable composite

.github/workflows/linting_comment.yml actions

dawidd6/action-download-artifact v2 composite
marocchino/sticky-pull-request-comment v2 composite

modules/local/fastqc2/meta.yml cpan

modules/local/fastqc3/meta.yml cpan

modules/local/seqkit/grep/meta.yml cpan

modules/nf-core/bioawk/meta.yml cpan

modules/nf-core/busco/meta.yml cpan

modules/nf-core/bwamem2/index/meta.yml cpan

modules/nf-core/bwamem2/mem/meta.yml cpan

modules/nf-core/canu/meta.yml cpan

modules/nf-core/centrifuge/centrifuge/meta.yml cpan

modules/nf-core/centrifuge/kreport/meta.yml cpan

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan

modules/nf-core/fastp/meta.yml cpan

modules/nf-core/fastqc/meta.yml cpan

modules/nf-core/flye/meta.yml cpan

modules/nf-core/genomescope2/meta.yml cpan

modules/nf-core/gunzip/meta.yml cpan

modules/nf-core/hifiasm/meta.yml cpan

modules/nf-core/kraken2/kraken2/meta.yml cpan

modules/nf-core/medaka/meta.yml cpan

modules/nf-core/merqury/meta.yml cpan

modules/nf-core/meryl/count/meta.yml cpan

modules/nf-core/minimap2/align/meta.yml cpan

modules/nf-core/minimap2/index/meta.yml cpan

modules/nf-core/multiqc/meta.yml cpan

modules/nf-core/mummer/meta.yml cpan

modules/nf-core/nanoplot/meta.yml cpan

modules/nf-core/purgedups/calcuts/meta.yml cpan

modules/nf-core/purgedups/pbcstat/meta.yml cpan

modules/nf-core/purgedups/purgedups/meta.yml cpan

modules/nf-core/pycoqc/meta.yml cpan

modules/nf-core/quast/meta.yml cpan

modules/nf-core/racon/meta.yml cpan

modules/nf-core/samblaster/meta.yml cpan

modules/nf-core/samtools/index/meta.yml cpan

modules/nf-core/seqkit/grep/meta.yml cpan

modules/nf-core/seqtk/seq/meta.yml cpan

modules/nf-core/racon/environment.yml pypi

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

qc