hivgenopipe

https://github.com/lhri-bioinformatics/hivgenopipe

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 7 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: LHRI-Bioinformatics
License: mit
Language: Python
Default Branch: main
Size: 2.17 MB

Statistics

Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 1

Created about 1 year ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License Code of conduct Citation

HIVGenoPipe

HIVGenoPipe Quick Start

Sample data set

The MiSeq data used for testing this pipeline are available under SRA BioProject accesstion PRJNA1218827

Create samplesheet

Use fastqdirto_samplesheet.py on a fastq directory to create a samplesheet

By default, the script expects _R1_001.fastq.gz and _R2_001.fastq.gz extentions for paired reads.

Please note that fastq files should also be zipped. The file extensions and sample name detection options can be changed with script options (Use --help for all options)

Example: python python fastq_dir_to_samplesheet.py <path_to_paired_fastq_files> The default output is <fastq_directory>_<date_stamp>_samplesheet.csv

Samplesheets should follow this example csv format: ``` sample,fastq1,fastq2,sampletype HivPos,fullpathtoHivPosR1001.fastq.gz,fullpathtoHivPosR2001.fastq.gz,positive Sample-1,fullpathtoSample-1R1001.fastq.gz,fullpathtoSample-1R2_001.fastq.gz,test

``Sample types may be test, positive, or negative. Users may identify sample types within the directory using the--positivecontrolidentifier or--negativecontrolidentifier ` flags. Samples labeled as the positive/negative controls will generate separate statistics for reporting. Samples marked as positive controls may optionally be analyzed against a control reference (i.e. NL4.3)

Run nextflow

[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Example minimal test run statement: bash nextflow run path_to_cloned_pipeline_dir(i.e.HIVGenoPipe)/main.nf \ -profile test,docker This should let you know if your nextflow environment has been set up properly. This workflow's samplesheet is slightly different from the typical nf-core so it will error out at this point, but all modules and run parameters should appear.

Example full run statement: bash nextflow run path_to_cloned_pipeline_dir(i.e. HIVGenoPipe)/main.nf \ --input samplesheet.csv \ -profile <docker/singularity/conda> \ --outdir <OUTDIR> \ --fasta K03455.1_HXB2_737-5219.fasta

Additional run options include: --metadata - <.csv> file that includes info for samples and can merge by the sample_id column to be appended to final report --rundir - path to Illumina run directory, will generate run level statistics in final report --positive_control_ref – performs additional alignment of pos control samples to this second reference (i.e. NL4.3)

Introduction

HIVGenopipe is a HIV drug resistance genotyping pipeline which deploys several well-known bioinformatics tools in combination with custom scripts to analyze NGS data (fastq) and return consensus sequences and relevant statistics. De novo assembly (Trinity) is used to create a preliminary contigs from cleaned FASTQ read data. A hybrid consensus is created by filling in any gaps between contigs using a reference sequence of the user's choosing. The cleaned reads are aligned to the hybrid consensus to form a secondary consensus. The reads are aligned again to the secondary consensus. This two alignment method better allows for the retention of minor mutations that may inform drug resistance calls. The second alignment is used to determine consensus calls at different ambiguity sensitivities (5% and 15%, and a majority consensus by default). These consensus sequences are then queried in the Stanford HIV drug resistance database for relevant mutations. Comprehensive reports and statistics are available as outputs.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

HIVGP Workflow

Pipeline summary

Input check and quality control (FastQC, Trimmomatic)
De Novo assembly (Trinity)
Clean non-HIV contigs (BBDuk)
Align contigs with reference and create hybrid consensus (MAFFT, custom script consensusFromMAFFT.py)
Align reads to hybrid consensus (bwa)
Create secondary consensus (samtools consensus)
Align reads to secondary consensus (bwa)
Create final consensus sequences (pysamstats, custom script pysamstats_parse.py)
Submit consensus to HIV drug resistance db (custom scripts stanford_json.py, stanford_summary.py)
Gather summary statistics and reports

Pipeline output

outdir ├── bbmap ├── final_consensus - all ambiguous/majority fastas ├── drug_resistance │ ├── stanford_json │ └── stanford_report - drug resistance, mutation, and protein seq reports ├── fastqc ├── filter_reads - HIV filtering ├── hybrid_align_bwa - first alignment ├── mafft ├── misc_reports │ ├── check_coverage │ ├── ngs_stats │ └── samtools_stats ├── multiqc ├── pipeline_info - nextflow run reporting ├── pysamstats_reports - in/del reports, read stats files │ ├── indel_report │ └── read_stats_files ├── ref_align │ ├── ref_align_bwa │ └── ref_align_depth - read depth per position in alignment to ref ├── report │ └── final_report.tsv - comprehensive report for all samples ├── sample_similarity_matrix - sample to sample alignment and matrix per % consensus ├── samtools_consensus ├── samtools_consensus_align │ ├── sam_cons_align_bwa - final alignment to sample-specific intermediate consensus │ └── sam_cons_align_depth - read depth per position in alignment to sample-specific intermediate consensus ├── trimmomatic - trimmed reads └── trinity - de novo assembled contigs

Customizing the workflow

Positive control reference subworkflow:

This positive control subworkflow is enabled with the --positive_control_ref <.fasta> parameter. This subworkflow will align any positive control samples (as labeled in the samplesheet) to an additional control reference sequence, separate from the reference used for consensus determination. For example, HXB2 is commonly used as a reference for consensus determination while NL4-3 can be used as a positive control reference. This subworkflow was designed to help validate positive control samples as they are used alongside patient test samples.

Example of positive control reference subworkflow output:

| QC Metric | QC Status | Value | QC Range | | --------- | --------- | ----- | -------- | | Total HIV Read1 | PASSED | 106724 | >=5000 | | Total HIV Read1(% of raw) | PASSED | 85.699373 | 50-100 | | Median Depth (reads) | PASSED | 12906 | 200-100000000 | | Depth MAD (reads) | PASSED | 2928.737| 200-100000000 | | Contig Coverage (%) | PASSED | 100% | 100% | | Error Rate (Read Variation %) | PASSED | 0.002454332 | <1% | | Contig Error Rate at 5% (%) | PASSED | 0 | <0.1% | | Contig Error Rate at 15% (%) | PASSED | 0 | <0.1% |

Module settings can be adjusted here:

/conf/modules.config/ For example, you can set the consensus calling script in the following sections: ```groovy withName: PYSAMSTATS { ext.args = '--min-baseq=15' }

withName: PYSAMSTATS_PARSER { ext.args = '-a 5 15 -d 200 -i -t -c' } ``- pysamstats--min-baseqcan be tuned to be more or less strict - pysamstats parser script setting-aor--ambiguity` controls the threshold for ambiguous base calls

Please see help messages for individual scripts or documentation of other pipeline tools for additional settings.

Computational resource settings can be directly adjusted here:

nextflow.config Or as params in the run statement ( e.g. --max_memory 64.GB) // Max resource options // Defaults only, expecting to be overwritten max_memory = '128.GB' max_cpus = 16 max_time = '240.h'

Credits

We thank the following people for their extensive assistance in the development of this pipeline:

LHRI Bioinformatics
VISL
Nf-core community

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Name: LHRI-Bioinformatics
Login: LHRI-Bioinformatics
Kind: organization

Repositories: 1
Profile: https://github.com/LHRI-Bioinformatics

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use `nf-core tools` in your work, please cite the `nf-core` publication"
authors:
  - family-names: Ewels
    given-names: Philip
  - family-names: Peltzer
    given-names: Alexander
  - family-names: Fillinger
    given-names: Sven
  - family-names: Patel
    given-names: Harshil
  - family-names: Alneberg
    given-names: Johannes
  - family-names: Wilm
    given-names: Andreas
  - family-names: Ulysse Garcia
    given-names: Maxime
  - family-names: Di Tommaso
    given-names: Paolo
  - family-names: Nahnsen
    given-names: Sven
title: "The nf-core framework for community-curated bioinformatics pipelines."
version: 2.4.1
doi: 10.1038/s41587-020-0439-x
date-released: 2022-05-16
url: https://github.com/nf-core/tools
prefered-citation:
  type: article
  authors:
    - family-names: Ewels
      given-names: Philip
    - family-names: Peltzer
      given-names: Alexander
    - family-names: Fillinger
      given-names: Sven
    - family-names: Patel
      given-names: Harshil
    - family-names: Alneberg
      given-names: Johannes
    - family-names: Wilm
      given-names: Andreas
    - family-names: Ulysse Garcia
      given-names: Maxime
    - family-names: Di Tommaso
      given-names: Paolo
    - family-names: Nahnsen
      given-names: Sven
  doi: 10.1038/s41587-020-0439-x
  journal: nature biotechnology
  start: 276
  end: 278
  title: "The nf-core framework for community-curated bioinformatics pipelines."
  issue: 3
  volume: 38
  year: 2020
  url: https://dx.doi.org/10.1038/s41587-020-0439-x

GitHub Events

Total

Release event: 3
Push event: 14
Fork event: 1
Create event: 1

Last Year

Release event: 3
Push event: 14
Fork event: 1
Create event: 1

Dependencies

modules/local/last/lastal/meta.yml cpan

modules/local/last/mafconvert/meta.yml cpan

modules/local/trimmomatic/meta.yml cpan

modules/nf-core/modules/bbmap/bbduk/meta.yml cpan

modules/nf-core/modules/bbmap/bbnorm/meta.yml cpan

modules/nf-core/modules/bwa/index/meta.yml cpan

modules/nf-core/modules/bwa/mem/meta.yml cpan

modules/nf-core/modules/custom/dumpsoftwareversions/meta.yml cpan

modules/nf-core/modules/fastqc/meta.yml cpan

modules/nf-core/modules/mafft/meta.yml cpan

modules/nf-core/modules/multiqc/meta.yml cpan

modules/nf-core/modules/samtools/depth/meta.yml cpan

modules/nf-core/modules/samtools/index/meta.yml cpan

modules/nf-core/modules/samtools/stats/meta.yml cpan

modules/nf-core/modules/trimmomatic/meta.yml cpan

modules/nf-core/modules/trinity/meta.yml cpan

modules/nf-core/modules/bbmap/bbduk/environment.yml pypi

modules/nf-core/modules/bbmap/bbnorm/environment.yml pypi

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science