qc
pipeline for quality checking of both raw and annotated assemblies
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov, plos.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Repository
pipeline for quality checking of both raw and annotated assemblies
Basic Info
Statistics
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Introduction
Argonaut performs automated reads to genome operations for de novo assemblies; it is a bioinformatics pipeline that performs genome assembly on long and short read data. A fastq file and input information is fed to the pipeline, resulting in final assemblies with completeness, contiguity, and correctnesss quality checking at each step. The pipeline accepts short reads, long reads, or both.
Table of Contents
Pipeline Summary
Illumina Short Read
1. Read QC, Adaptor Trimming, Contaminant Filtering(FastQC v0.11.9, FastP v0.23.4, GenomeScope2 v2.0,Jellyfish v2.2.6,Kraken2 v2.1.2, Recentrifuge v1.9.1,)
PacBio HiFi Long Read (CCS format)
1. Read QC, Adaptor Trimming, Contaminant Filtering(Nanoplot v1.41.0,CutAdapt v3.4,GenomeScope2 v2.0,Jellyfish v2.2.6,Kraken2 v2.1.2, Recentrifuge v1.9.1)
2. Length Filtering (optional)(Seqkit v2.4.0, Nanoplot v1.41.0)
ONT Long Read
1. Read QC and Contaminant Filtering(Nanoplot v1.41.0,KmerFreq, GCE, Centrifuge v1.0.4, Recentrifuge v1.9.1)
2. Length Filtering (optional)(Seqkit v2.4.0, Nanoplot v1.41.0)
All reads are used for the following steps:
- Assembly
Flye v2.9Canu v2.2Verkko v2.2Hifiasm v0.19.8MaSuRCA v4.1.0Polish
Purge
Scaffolding
Quality Checking
Assembly Visualization
To the right is a figure detailing the major workflow steps involved in hybrid assembly.
If you indicate that you would like for long read polishers to be run, the pipeline will default to using PacBio HiFi reads, and using ONT if no PacBio HiFi is available. If short reads are also available, they will automatically be used to polish the assemblies after long read polishing (or assembly if long read polishing is off).
Purge Haplotigs is the first step of manual curation, as it produces a histogram that needs to be analyzed for -l, -m, -h flags. The pipeline will stop at the purge step if purge is activated in your configuration and wait for manual input of parameters according to the histogram of your assembly, which can be found in your out directory.
Quick Start
Installation
Only Nextflow and Singularity need to be installed to run Argonaut. Users that would like to run Centrifuge and/or Kraken2 will need to provide a database. There are similar restrictions for running Recentrifuge and Blobtools with Blast and NCBI taxdump. Follow the links provided for database download directions. Xanadu users running Argonaut at the University of Connecticut may use the database paths provided in the example params.yaml
Samplesheets
To get started setting up your run, prepare a samplesheet with your input data as follows:
samplesheet.csv:
csv
sample,fastq_1,fastq_2,single_end,read_type
chr3_gibbon_pb,/core/projects/EBP/conservation/gen_assembly_pipeline/hoolock_chrm_3/chr3_pb.fastq.gz,,TRUE,pb
chr3_gibbon_ont,/core/projects/EBP/conservation/gen_assembly_pipeline/hoolock_chrm_3/chr3_ont.fastq.gz,,TRUE,ont
chr3_gibbon_ill,/core/projects/EBP/conservation/gen_assembly_pipeline/hoolock_chrm_3/chr3_ill_R1.paired.fastq.gz,/core/projects/EBP/conservation/gen_assembly_pipeline/hoolock_chrm_3/chr3_ill_R2.paired.fastq.gz,FALSE,ill
!!! PLEASE ADD "ont", "pb", AND/OR "ill" TO YOUR SAMPLES NAMES !!! Failure to do so may result in assemblers not recognizing your read type and/or outputs being overwritten.
The sample name inputted in your samplesheet will serve as the prefix for your output files. Please indicate which kind of read is being inputted in the sample name, as well as the read type column.
After you have your samplesheet, create a params.yaml file to specify the paths to your samplesheet, contaminant databases, etc. Most likely, a config file will also need to be made to modify the default settings of the pipeline. Please look through the nextflow.config file to browse the defaults and specify which you would like to change in your my_config file. More information is located in the usage section.
Now, you can run the pipeline using:
bash
nextflow run emilytrybulec/argonaut \
-r main \
-params-file params.yaml \
-c my_config \
-profile singularity,xanadu \
Pipeline output
All of the output from the programs run in the pipeline pipeline will be located in the out directory specified in params.yaml. The pipeline produces the following labeled directories depending on configurations:
├── 01 READ QC
│ ├── centrifuge
│ ├── fastp
│ ├── fastqc
│ ├── genome size est
│ │ ├── genomescope2
│ │ ├── jellyfish
│ │ ├── ont gce
│ │ ├── ont kmerfreq
│ ├── kraken2
│ ├── nanoplot
│ ├── pacbio cutadapt
├── 02 ASSEMBLY
│ ├── hybrid
│ ├── long read
│ ├── short read
├── 03 POLISH
│ ├── hybrid
│ │ ├── polca
│ ├── long read
│ │ ├── medaka
│ │ ├── racon
├── 04 PURGE
│ ├── align
│ ├── histogram
│ ├── purge haplotigs
│ ├── short read redundans
├── 05 SCAFFOLD
├── ASSEMBLY QC
│ ├── busco
│ ├── bwamem2
│ ├── merqury
│ ├── minimap2
│ ├── quast
│ ├── samtools
├── OUTPUT
│ ├── blobtools visualization
│ ├── coverage
│ ├── genome size estimation
│ ├── *assemblyStats.txt
├── PIPELINE INFO
└── execution_trace_*.txt
Some output files have labels such as "dc", indicating that the reads have been decontaminated, or "lf", indicating that reads have been length filtered.
Information about interpreting output is located in the output section.
Credits
emilytrybulec/genomeassembly was originally written by Emily Trybulec.
I thank the following people for their extensive assistance in the development of this pipeline:
University of Connecticut:
- Biodiversity and Conservation Genomics Center
- Jill Wegrzyn
- Cynthia Webster
- Anthony He
- Laurel Humphrey
- Keertana Chagari
- Amanda Mueller
- Cristopher Guzman
- Harshita Akella
- Jill Wegrzyn
- Rachel O'Neill Lab
- Rachel O’Neill
- Michelle Neitzey
- Nicole Pauloski
- Vel Johnston
- Rachel O’Neill
Computational Biology Core
- Noah Reid
- Gabe Barrett
- Noah Reid
nf-core Community
Zbigniew Trybulec
Contributions and Support
Development of this pipeline was funded by the University of Connecticut Office of Undergraduate Research through the Summer Undergraduate Research Fund (SURF) Grant.
The Biodiversity and Conservation Genomics Center is a part of the Earth Biogenome Project, working towards capturing the genetic diversity of life on Earth.
Citations
Argonaut is currently unpublished. For now, please use the GitHub URL when referencing.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Login: henry-schober
- Kind: user
- Repositories: 1
- Profile: https://github.com/henry-schober
Citation (CITATIONS.md)
# emilytrybulec/genomeassembly: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [Bioawk](https://github.com/lh3/bioawk) - [BUSCO](https://busco.ezlab.org/) > Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics (Oxford, England), 31(19), 3210–3212. https://doi.org/10.1093/bioinformatics/btv351 - [Canu](https://github.com/marbl/canu) > Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., & Phillippy, A. M. (2017). Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5), 722–736. https://doi.org/10.1101/gr.215087.116 - [Centrifuge](http://www.ccb.jhu.edu/software/centrifuge/) > Kim, D., Song, L., Breitwieser, F. P., & Salzberg, S. L. (2016). Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research, 26(12), 1721–1729. https://doi.org/10.1101/gr.210641.116 - [Fastp](https://github.com/OpenGene/fastp) > Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107 > Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560 - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) - [Flye](https://github.com/fenderglass/Flye) > Kolmogorov, M., Yuan, J., Lin, Y., & Pevzner, P. A. (2019). Assembly of long, error-prone reads using repeat graphs. Nature biotechnology, 37(5), 540–546. https://doi.org/10.1038/s41587-019-0072-8 - [GCE](https://github.com/fanagislab/GCE) > Binghang Liu, Yujian Shi, Jianying Yuan, et al. and Wei Fan*. (2013). Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome project. arXiv.org arXiv: 1308.2012. - [GenomeScope2](https://github.com/tbenavi1/genomescope2.0) > Ranallo-Benavidez, T. R., Jaron, K. S., & Schatz, M. C. (2020). GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature communications, 11(1), 1432. https://doi.org/10.1038/s41467-020-14998-3 - [Gzip](https://www.gzip.org/) - [Jellyfish](https://github.com/gmarcais/Jellyfish) > Guillaume Marcais and Carl Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 (first published online January 7, 2011) doi:10.1093/bioinformatics/btr011 - [Kmerfreq](https://github.com/fanagislab/kmerfreq) > Hengchao Wang, Bo Liu, Yan Zhang, Fan Jiang, Yuwei Ren, Lijuan Yin, Hangwei Liu, Sen Wang, Wei Fan. (2020). Estimation of genome size using k-mer frequencies from corrected long reads. arXiv:2003.11817 [q-bio.GN] > Binghang Liu, Yujian Shi, Jianying Yuan, et al. and Wei Fan*. (2013). Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome project. arXiv.org arXiv: 1308.2012. - [Kraken2](https://github.com/DerrickWood/kraken2/tree/master) > Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome biology, 20(1), 257. https://doi.org/10.1186/s13059-019-1891-0 - [MaSuRCA](https://github.com/alekseyzimin/masurca) > Zimin, A. V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S. L., & Yorke, J. A. (2013). The MaSuRCA genome assembler. Bioinformatics (Oxford, England), 29(21), 2669–2677. https://doi.org/10.1093/bioinformatics/btt476 - [Medaka](https://github.com/nanoporetech/medaka) - [Merqury](https://github.com/marbl/merqury) > Rhie, A., Walenz, B.P., Koren, S. et al. (2020).Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245. https://doi.org/10.1186/s13059-020-02134-9 - [Meryl](https://github.com/marbl/meryl) > Rhie, A., Walenz, B.P., Koren, S. et al. (2020).Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245. https://doi.org/10.1186/s13059-020-02134-9 - [Minimap2](https://github.com/lh3/minimap2/tree/master) > Li, H. (2021). New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705 > Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191 - [Nanoplot](https://github.com/wdecoster/NanoPlot) > De Coster, W., D'Hert, S., Schultz, D. T., Cruts, M., & Van Broeckhoven, C. (2018). NanoPack: visualizing and processing long-read sequencing data. Bioinformatics (Oxford, England), 34(15), 2666–2669. https://doi.org/10.1093/bioinformatics/bty149 - [Numfmt](https://github.com/borgar/numfmt) - [POLCA](https://github.com/alekseyzimin/masurca) > Zimin, A. V., & Salzberg, S. L. (2020). The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS computational biology, 16(6), e1007981. https://doi.org/10.1371/journal.pcbi.1007981 - [Purge](https://bitbucket.org/mroachawri/purge_haplotigs/src/master/) > Roach, M. J., Schmidt, S. A., & Borneman, A. R. (2018). Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC bioinformatics, 19(1), 460. https://doi.org/10.1186/s12859-018-2485-7 - [PycoQC](https://github.com/a-slide/pycoQC) > Leger, Adrien & Leonardi, Tommaso. (2019). pycoQC, interactive quality control for Oxford Nanopore Sequencing. Journal of Open Source Software, 4(34), 1236. https://doi.org/10.21105/joss.01236 - [Quast](https://quast.sourceforge.net/) > Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics (Oxford, England), 29(8), 1072–1075. https://doi.org/10.1093/bioinformatics/btt086 - [Ragtag](https://github.com/malonge/RagTag) > Alonge, M., Lebeigle, L., Kirsche, M., Jenike, K., Ou, S., Aganezov, S., Wang, X., Lippman, Z. B., Schatz, M. C., & Soyk, S. (2022). Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome biology, 23(1), 258. https://doi.org/10.1186/s13059-022-02823-7 - [Recentrifuge](https://github.com/khyox/recentrifuge) > Martí J.M. (2019). Recentrifuge: Robust comparative analysis and contamination removal for metagenomics. PLOS Computational Biology 15(4): e1006967. https://doi.org/10.1371/journal.pcbi.1006967 - [Samtools](https://github.com/samtools/samtools) > Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li. (February 2021) GigaScience, Volume 10, Issue 2, giab008, https://doi.org/10.1093/gigascience/giab008 - [Seqkit](https://github.com/shenwei356/seqkit) > W Shen, S Le, Y Li*, F Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962. ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Member event: 1
- Push event: 7
- Pull request event: 2
Last Year
- Member event: 1
- Push event: 7
- Pull request event: 2
Dependencies
- actions/upload-artifact v3 composite
- seqeralabs/action-tower-launch v1 composite
- actions/upload-artifact v3 composite
- seqeralabs/action-tower-launch v1 composite
- mshick/add-pr-comment v1 composite
- actions/checkout v3 composite
- nf-core/setup-nextflow v1 composite
- actions/stale v7 composite
- actions/checkout v3 composite
- actions/setup-node v3 composite
- actions/checkout v3 composite
- actions/setup-node v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- mshick/add-pr-comment v1 composite
- nf-core/setup-nextflow v1 composite
- psf/black stable composite
- dawidd6/action-download-artifact v2 composite
- marocchino/sticky-pull-request-comment v2 composite