Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: LHRI-Bioinformatics
- License: mit
- Language: Python
- Default Branch: main
- Size: 2.17 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
HIVGenoPipe
HIVGenoPipe Quick Start
Sample data set
The MiSeq data used for testing this pipeline are available under SRA BioProject accesstion PRJNA1218827
Create samplesheet
Use fastqdirto_samplesheet.py on a fastq directory to create a samplesheet
By default, the script expects _R1_001.fastq.gz and _R2_001.fastq.gz extentions for paired reads.
Please note that fastq files should also be zipped.
The file extensions and sample name detection options can be changed with script options (Use --help for all options)
Example:
python
python fastq_dir_to_samplesheet.py <path_to_paired_fastq_files>
The default output is <fastq_directory>_<date_stamp>_samplesheet.csv
Samplesheets should follow this example csv format: ``` sample,fastq1,fastq2,sampletype HivPos,fullpathtoHivPosR1001.fastq.gz,fullpathtoHivPosR2001.fastq.gz,positive Sample-1,fullpathtoSample-1R1001.fastq.gz,fullpathtoSample-1R2_001.fastq.gz,test
``
Sample types may be test, positive, or negative. Users may identify sample types within the directory using the--positivecontrolidentifier or--negativecontrolidentifier
Run nextflow
[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with
-profile testbefore running the workflow on actual data.
Example minimal test run statement:
bash
nextflow run path_to_cloned_pipeline_dir(i.e.HIVGenoPipe)/main.nf \
-profile test,docker
This should let you know if your nextflow environment has been set up properly. This workflow's samplesheet is slightly different from the typical nf-core so it will error out at this point, but all modules and run parameters should appear.
Example full run statement:
bash
nextflow run path_to_cloned_pipeline_dir(i.e. HIVGenoPipe)/main.nf \
--input samplesheet.csv \
-profile <docker/singularity/conda> \
--outdir <OUTDIR> \
--fasta K03455.1_HXB2_737-5219.fasta
Additional run options include:
--metadata - <.csv> file that includes info for samples and can merge by the sample_id column to be appended to final report
--rundir - path to Illumina run directory, will generate run level statistics in final report
--positive_control_ref – performs additional alignment of pos control samples to this second reference (i.e. NL4.3)
Introduction
HIVGenopipe is a HIV drug resistance genotyping pipeline which deploys several well-known bioinformatics tools in combination with custom scripts to analyze NGS data (fastq) and return consensus sequences and relevant statistics. De novo assembly (Trinity) is used to create a preliminary contigs from cleaned FASTQ read data. A hybrid consensus is created by filling in any gaps between contigs using a reference sequence of the user's choosing. The cleaned reads are aligned to the hybrid consensus to form a secondary consensus. The reads are aligned again to the secondary consensus. This two alignment method better allows for the retention of minor mutations that may inform drug resistance calls. The second alignment is used to determine consensus calls at different ambiguity sensitivities (5% and 15%, and a majority consensus by default). These consensus sequences are then queried in the Stanford HIV drug resistance database for relevant mutations. Comprehensive reports and statistics are available as outputs.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
Pipeline summary
- Input check and quality control (
FastQC,Trimmomatic) - De Novo assembly (
Trinity) - Clean non-HIV contigs (
BBDuk) - Align contigs with reference and create hybrid consensus (
MAFFT, custom scriptconsensusFromMAFFT.py) - Align reads to hybrid consensus (
bwa) - Create secondary consensus (
samtools consensus) - Align reads to secondary consensus (
bwa) - Create final consensus sequences (
pysamstats, custom scriptpysamstats_parse.py) - Submit consensus to HIV drug resistance db (custom scripts
stanford_json.py,stanford_summary.py) - Gather summary statistics and reports
Pipeline output
outdir
├── bbmap
├── final_consensus - all ambiguous/majority fastas
├── drug_resistance
│ ├── stanford_json
│ └── stanford_report - drug resistance, mutation, and protein seq reports
├── fastqc
├── filter_reads - HIV filtering
├── hybrid_align_bwa - first alignment
├── mafft
├── misc_reports
│ ├── check_coverage
│ ├── ngs_stats
│ └── samtools_stats
├── multiqc
├── pipeline_info - nextflow run reporting
├── pysamstats_reports - in/del reports, read stats files
│ ├── indel_report
│ └── read_stats_files
├── ref_align
│ ├── ref_align_bwa
│ └── ref_align_depth - read depth per position in alignment to ref
├── report
│ └── final_report.tsv - comprehensive report for all samples
├── sample_similarity_matrix - sample to sample alignment and matrix per % consensus
├── samtools_consensus
├── samtools_consensus_align
│ ├── sam_cons_align_bwa - final alignment to sample-specific intermediate consensus
│ └── sam_cons_align_depth - read depth per position in alignment to sample-specific intermediate consensus
├── trimmomatic - trimmed reads
└── trinity - de novo assembled contigs
Customizing the workflow
Positive control reference subworkflow:
This positive control subworkflow is enabled with the --positive_control_ref <.fasta> parameter. This subworkflow will align any positive control samples (as labeled in the samplesheet) to an additional control reference sequence, separate from the reference used for consensus determination. For example, HXB2 is commonly used as a reference for consensus determination while NL4-3 can be used as a positive control reference. This subworkflow was designed to help validate positive control samples as they are used alongside patient test samples.
Example of positive control reference subworkflow output:
| QC Metric | QC Status | Value | QC Range | | --------- | --------- | ----- | -------- | | Total HIV Read1 | PASSED | 106724 | >=5000 | | Total HIV Read1(% of raw) | PASSED | 85.699373 | 50-100 | | Median Depth (reads) | PASSED | 12906 | 200-100000000 | | Depth MAD (reads) | PASSED | 2928.737| 200-100000000 | | Contig Coverage (%) | PASSED | 100% | 100% | | Error Rate (Read Variation %) | PASSED | 0.002454332 | <1% | | Contig Error Rate at 5% (%) | PASSED | 0 | <0.1% | | Contig Error Rate at 15% (%) | PASSED | 0 | <0.1% |
Module settings can be adjusted here:
/conf/modules.config/
For example, you can set the consensus calling script in the following sections:
```groovy
withName: PYSAMSTATS {
ext.args = '--min-baseq=15'
}
withName: PYSAMSTATS_PARSER {
ext.args = '-a 5 15 -d 200 -i -t -c'
}
``
- pysamstats--min-baseqcan be tuned to be more or less strict
- pysamstats parser script setting-aor--ambiguity` controls the threshold for ambiguous base calls
Please see help messages for individual scripts or documentation of other pipeline tools for additional settings.
Computational resource settings can be directly adjusted here:
nextflow.config
Or as params in the run statement ( e.g. --max_memory 64.GB)
// Max resource options
// Defaults only, expecting to be overwritten
max_memory = '128.GB'
max_cpus = 16
max_time = '240.h'
Credits
We thank the following people for their extensive assistance in the development of this pipeline:
- LHRI Bioinformatics
- VISL
- Nf-core community
Citations
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Name: LHRI-Bioinformatics
- Login: LHRI-Bioinformatics
- Kind: organization
- Repositories: 1
- Profile: https://github.com/LHRI-Bioinformatics
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use `nf-core tools` in your work, please cite the `nf-core` publication"
authors:
- family-names: Ewels
given-names: Philip
- family-names: Peltzer
given-names: Alexander
- family-names: Fillinger
given-names: Sven
- family-names: Patel
given-names: Harshil
- family-names: Alneberg
given-names: Johannes
- family-names: Wilm
given-names: Andreas
- family-names: Ulysse Garcia
given-names: Maxime
- family-names: Di Tommaso
given-names: Paolo
- family-names: Nahnsen
given-names: Sven
title: "The nf-core framework for community-curated bioinformatics pipelines."
version: 2.4.1
doi: 10.1038/s41587-020-0439-x
date-released: 2022-05-16
url: https://github.com/nf-core/tools
prefered-citation:
type: article
authors:
- family-names: Ewels
given-names: Philip
- family-names: Peltzer
given-names: Alexander
- family-names: Fillinger
given-names: Sven
- family-names: Patel
given-names: Harshil
- family-names: Alneberg
given-names: Johannes
- family-names: Wilm
given-names: Andreas
- family-names: Ulysse Garcia
given-names: Maxime
- family-names: Di Tommaso
given-names: Paolo
- family-names: Nahnsen
given-names: Sven
doi: 10.1038/s41587-020-0439-x
journal: nature biotechnology
start: 276
end: 278
title: "The nf-core framework for community-curated bioinformatics pipelines."
issue: 3
volume: 38
year: 2020
url: https://dx.doi.org/10.1038/s41587-020-0439-x
GitHub Events
Total
- Release event: 3
- Push event: 14
- Fork event: 1
- Create event: 1
Last Year
- Release event: 3
- Push event: 14
- Fork event: 1
- Create event: 1