nf-fastq2bam

Nextflow pipeline to convert FASTQ files to BAM for WGS and RNA-seq data

https://github.com/ccicb/nf-fastq2bam

Keywords

alignment bam bioinformatics fastq nextflow pipeline wgs

Last synced: 6 months ago · JSON representation ·

Repository

Nextflow pipeline to convert FASTQ files to BAM for WGS and RNA-seq data

Basic Info

Host: GitHub
Owner: CCICB
License: mit
Language: Nextflow
Default Branch: main
Homepage:
Size: 225 KB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Topics

alignment bam bioinformatics fastq nextflow pipeline wgs

Created 9 months ago · Last pushed 9 months ago

Metadata Files

Readme License Citation

Introduction

Many bioinformatic tools require a BAM file as input. nf-fastq2bam is a nextflow workflow that takes FASTQ files as input and outputs a single mapped BAM for both Whole genome sequencing(WGS) as well as RNA sequencing experiments.
nf-fastq2bam runs the following steps from FASTQ files:

TUMOR/NORMAL WGS FASTQ -> BWAMEM2ALIGN -> SAMBAMBAMERGE -> SAMTOOLSSORT -> GATK4MARKDUPLICATES
RNA FASTQ -> STARALIGN -> SAMBAMBAMERGE -> SAMTOOLSSORT -> GATK4MARKDUPLICATES

Usage

[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Create a samplesheet.csv with your inputs (WGS/WTS BAMs in this example):

csv group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath PATIENT1_WGTS,PATIENT1,PATIENT1_tumoursample,tumor,dna,fastq,library_id:HH5FYCCXY_library;lane:6,/path/2/data/patient1_tumoursample/HH5FYCCXY_6_180321_FD01114733_R1.fastq.gz;/path/2/data/patient1_tumoursample/HH5FYCCXY_6_180321_FD01114733_R2.fastq.gz PATIENT1_WGTS,PATIENT1,PATIENT1_tumoursample,tumor,dna,fastq,library_id:HH5FYCCXY_library;lane:5,/path/2/data/patient1_tumoursample/HH5FYCCXY_5_180321_FD01114733_R1.fastq.gz;/path/2/data/patient1_tumoursample/HH5FYCCXY_5_180321_FD01114733_R2.fastq.gz PATIENT1_WGTS,PATIENT1,PATIENT1_normalsample,normal,dna,fastq,library_id:HH5FYCCXZ_library;lane:7,/path/2/data/patient1_normalsample/HH5FYCCXZ_7_180321_FD01114735_R1.fastq.gz;/path/2/data/patient1_normalsample/HH5FYCCXZ_7_180321_FD01114735_R2.fastq.gz PATIENT1_WGTS,PATIENT1,PATIENT1_rnasample,tumor,rna,fastq,library_id:HH5FYCCXZ_library;lane:9,/path/2/data/patient1_rnasample/HH5FYCCXZ_9_180321_FD01114735_R1.fastq.gz;/path/2/data/patient1_rnasample/HH5FYCCXZ_9_180321_FD01114735_R2.fastq.gz

Your params.yaml file:

ref_data_genome_fasta : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta ref_data_genome_fai : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta.fai ref_data_genome_dict : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta.dict ref_data_genome_bwa_index_image : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta.img ref_data_genome_gridss_index : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta.gridsscache

Your run.config file for HPC environments, e.g. NCI Gadi:

``` params { // Tip: use conf/hmfgenomes.config as a guide configprofiledescription = 'NCI Gadi HPC profile' configprofileurl = 'https://opus.nci.org.au/display/Help/Gadi+User+Guide' genomes { 'GRCh38hmf' { fasta = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta" fai = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.fai" dict = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.dict" img = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.img" gridssindex = "/path/2/hmfresources/38/gridssindex-2.13.2.tar.gz" bwamem2index = "/path/2/hmfresources/38/bwa-mem2index-2.2.1.tar.gz" starindex = "/path/2/hmfresources/38/starindex-gencode38-2.7.3a.tar.gz" } }

ref_data_hmf_data_path        = "/path/2/hmf_resources/38/hmf_pipeline_resources.38_v2.0--3"
ref_data_panel_data_path      = "/path/2/hmf_resources/38/panel/tso500/hmf_panel_resources.tso500.38_v2.1.0--3"

}

singularity { enabled = true cacheDir = '/path/2/cache' autoMounts = true }

executor { queueSize = 200 pollInterval = '5 min' queueStatInterval = '5 min' submitRateLimit = '20 min' }

process { resourceLimits = [ cpus: 32, memory: 1020.GB, time: 48.h ] errorStrategy = 'retry' maxRetries = 3 executor = 'pbspro' project = '' module = 'singularity' cache = 'lenient' stageInMode = 'symlink' }

```

Your run.config file if running in a local environment:

``` params { genomes { 'GRCh38hmf' { fasta = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta" fai = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.fai" dict = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.dict" img = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.img" gridssindex = "/path/2/hmfresources/38/gridssindex-2.13.2.tar.gz" bwamem2index = "/path/2/hmfresources/38/bwa-mem2index-2.2.1.tar.gz" starindex = "/path/2/hmfresources/38/starindex-gencode38-2.7.3a.tar.gz" } } refdatahmfdatapath = "/path/2/hmfresources/38/hmfpipelineresources.38v2.0--3" refdatapaneldatapath = "/path/2/hmfresources/38/panel/tso500/hmfpanelresources.tso500.38_v2.1.0--3" }

process { // defaults for all processes. resourceLimits = [ cpus: 32, memory: 1020.GB, time: 48.h ] cpus = { checkmax( 1 * task.attempt, 'cpus' ) } memory = { checkmax( 6.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) }

errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
maxRetries    = 1
maxErrors     = '-1'

// The 'local' executor is used by default. Please see here for other executor options https://www.nextflow.io/docs/latest/executor.html#executors
// such as AWS batch, AZURE batch, PBS, Slurm, etc.
executor = 'local'

}

```

Recommended resources.config: ``` process { withName: BAMTOOLS { memory = { 24.GB * task.attempt }
cpus = { 16 * task.attempt } } withName: 'BWAMEM2ALIGN.*' { cpus = 4 memory = { 64.GB * task.attempt }

} withName: 'FASTP.*' { memory = { 64.GB * task.attempt }
cpus = 16 } } ```

Launch nf-fastq2bam:

bash git clone git@github.com:CCICB/nf-fastq2bam.git

bash nextflow run nf-fastq2bam \ -profile <docker|singularity|...> \ -c <your run.config and resources.config files here, separated by commas. You may also list other config files.> \ -params-file <your params.yaml> \ --mode <wgts|targeted> \ --genome_type <alt|no_alt> \ --genome <GRCh37_hmf|GRCh38_hmf> \ --input samplesheet.csv \ --outdir output/

[!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Pipeline output

The output of this pipeline contains processed BAM files for different sample types. This includes aligned, merged, sorted, and duplicate-marked BAM files from tumour, normal and RNA-Seq samples.

Citations

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Name: CCICB
Login: CCICB
Kind: organization

Repositories: 4
Profile: https://github.com/CCICB

Citation (CITATIONS.md)

# nf-fastq2bam: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [BCFtools](https://doi.org/10.1093/gigascience/giab008)

> Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008

- [BWA](https://doi.org/10.1093/bioinformatics/btp324)

> Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324

- [bwa-mem2](https://doi.org/10.1109/IPDPS.2019.00041)

> Vasimuddin, Md., Misra, S., Li, H., & Aluru, S. (2019). Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 314–324. https://doi.org/10.1109/IPDPS.2019.00041

- [fastp](https://doi.org/10.1093/bioinformatics/bty560)

> Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560

- [GATK](https://doi.org/10.1093/bioinformatics/btp324)

> McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., & DePristo, M. A. (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303. https://doi.org/10.1101/gr.107524.110

- [Sambamba](https://doi.org/10.1093/bioinformatics/btv098)

> Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J., & Prins, P. (2015). Sambamba: Fast processing of NGS alignment formats. Bioinformatics, 31(12), 2032–2034. https://doi.org/10.1093/bioinformatics/btv098

- [SAMtools](https://doi.org/10.1093/gigascience/giab008)

> Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

> Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

> Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

> da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

> Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

> Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science