nf-fastq2bam
Nextflow pipeline to convert FASTQ files to BAM for WGS and RNA-seq data
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.4%) to scientific vocabulary
Keywords
Repository
Nextflow pipeline to convert FASTQ files to BAM for WGS and RNA-seq data
Basic Info
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Introduction
Many bioinformatic tools require a BAM file as input. nf-fastq2bam is a nextflow workflow that takes FASTQ files as input and outputs a single mapped BAM for both Whole genome sequencing(WGS) as well as RNA sequencing experiments.
nf-fastq2bam runs the following steps from FASTQ files:
TUMOR/NORMAL WGS FASTQ -> BWAMEM2ALIGN -> SAMBAMBAMERGE -> SAMTOOLSSORT -> GATK4MARKDUPLICATES
RNA FASTQ -> STARALIGN -> SAMBAMBAMERGE -> SAMTOOLSSORT -> GATK4MARKDUPLICATES
Usage
[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with
-profile testbefore running the workflow on actual data.
Create a samplesheet.csv with your inputs (WGS/WTS BAMs in this example):
csv
group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath
PATIENT1_WGTS,PATIENT1,PATIENT1_tumoursample,tumor,dna,fastq,library_id:HH5FYCCXY_library;lane:6,/path/2/data/patient1_tumoursample/HH5FYCCXY_6_180321_FD01114733_R1.fastq.gz;/path/2/data/patient1_tumoursample/HH5FYCCXY_6_180321_FD01114733_R2.fastq.gz
PATIENT1_WGTS,PATIENT1,PATIENT1_tumoursample,tumor,dna,fastq,library_id:HH5FYCCXY_library;lane:5,/path/2/data/patient1_tumoursample/HH5FYCCXY_5_180321_FD01114733_R1.fastq.gz;/path/2/data/patient1_tumoursample/HH5FYCCXY_5_180321_FD01114733_R2.fastq.gz
PATIENT1_WGTS,PATIENT1,PATIENT1_normalsample,normal,dna,fastq,library_id:HH5FYCCXZ_library;lane:7,/path/2/data/patient1_normalsample/HH5FYCCXZ_7_180321_FD01114735_R1.fastq.gz;/path/2/data/patient1_normalsample/HH5FYCCXZ_7_180321_FD01114735_R2.fastq.gz
PATIENT1_WGTS,PATIENT1,PATIENT1_rnasample,tumor,rna,fastq,library_id:HH5FYCCXZ_library;lane:9,/path/2/data/patient1_rnasample/HH5FYCCXZ_9_180321_FD01114735_R1.fastq.gz;/path/2/data/patient1_rnasample/HH5FYCCXZ_9_180321_FD01114735_R2.fastq.gz
Your params.yaml file:
ref_data_genome_fasta : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta
ref_data_genome_fai : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta.fai
ref_data_genome_dict : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta.dict
ref_data_genome_bwa_index_image : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta.img
ref_data_genome_gridss_index : /scratch/df13/resources/hmf_resources/38/hg38_alts_decoys_phiX_masked_GRC.fasta.gridsscache
Your run.config file for HPC environments, e.g. NCI Gadi:
``` params { // Tip: use conf/hmfgenomes.config as a guide configprofiledescription = 'NCI Gadi HPC profile' configprofileurl = 'https://opus.nci.org.au/display/Help/Gadi+User+Guide' genomes { 'GRCh38hmf' { fasta = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta" fai = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.fai" dict = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.dict" img = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.img" gridssindex = "/path/2/hmfresources/38/gridssindex-2.13.2.tar.gz" bwamem2index = "/path/2/hmfresources/38/bwa-mem2index-2.2.1.tar.gz" starindex = "/path/2/hmfresources/38/starindex-gencode38-2.7.3a.tar.gz" } }
ref_data_hmf_data_path = "/path/2/hmf_resources/38/hmf_pipeline_resources.38_v2.0--3"
ref_data_panel_data_path = "/path/2/hmf_resources/38/panel/tso500/hmf_panel_resources.tso500.38_v2.1.0--3"
}
singularity { enabled = true cacheDir = '/path/2/cache' autoMounts = true }
executor { queueSize = 200 pollInterval = '5 min' queueStatInterval = '5 min' submitRateLimit = '20 min' }
process {
resourceLimits = [ cpus: 32, memory: 1020.GB, time: 48.h ]
errorStrategy = 'retry'
maxRetries = 3
executor = 'pbspro'
project = '
```
Your run.config file if running in a local environment:
``` params { genomes { 'GRCh38hmf' { fasta = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta" fai = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.fai" dict = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.dict" img = "/path/2/hmfresources/38/hg38altsdecoysphiXmaskedGRC.fasta.img" gridssindex = "/path/2/hmfresources/38/gridssindex-2.13.2.tar.gz" bwamem2index = "/path/2/hmfresources/38/bwa-mem2index-2.2.1.tar.gz" starindex = "/path/2/hmfresources/38/starindex-gencode38-2.7.3a.tar.gz" } } refdatahmfdatapath = "/path/2/hmfresources/38/hmfpipelineresources.38v2.0--3" refdatapaneldatapath = "/path/2/hmfresources/38/panel/tso500/hmfpanelresources.tso500.38_v2.1.0--3" }
process { // defaults for all processes. resourceLimits = [ cpus: 32, memory: 1020.GB, time: 48.h ] cpus = { checkmax( 1 * task.attempt, 'cpus' ) } memory = { checkmax( 6.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) }
errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
// The 'local' executor is used by default. Please see here for other executor options https://www.nextflow.io/docs/latest/executor.html#executors
// such as AWS batch, AZURE batch, PBS, Slurm, etc.
executor = 'local'
}
```
Recommended resources.config:
```
process {
withName: BAMTOOLS {
memory = { 24.GB * task.attempt }
cpus = { 16 * task.attempt }
}
withName: 'BWAMEM2ALIGN.*' {
cpus = 4
memory = { 64.GB * task.attempt }
}
withName: 'FASTP.*' {
memory = { 64.GB * task.attempt }
cpus = 16
}
}
```
Launch nf-fastq2bam:
bash
git clone git@github.com:CCICB/nf-fastq2bam.git
bash
nextflow run nf-fastq2bam \
-profile <docker|singularity|...> \
-c <your run.config and resources.config files here, separated by commas. You may also list other config files.> \
-params-file <your params.yaml> \
--mode <wgts|targeted> \
--genome_type <alt|no_alt> \
--genome <GRCh37_hmf|GRCh38_hmf> \
--input samplesheet.csv \
--outdir output/
[!WARNING] Please provide pipeline parameters via the CLI or Nextflow
-params-fileoption. Custom config files including those provided by the-cNextflow option can be used to provide any configuration except for parameters; see docs.
Pipeline output
The output of this pipeline contains processed BAM files for different sample types. This includes aligned, merged, sorted, and duplicate-marked BAM files from tumour, normal and RNA-Seq samples.
Citations
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Name: CCICB
- Login: CCICB
- Kind: organization
- Repositories: 4
- Profile: https://github.com/CCICB
Citation (CITATIONS.md)
# nf-fastq2bam: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [BCFtools](https://doi.org/10.1093/gigascience/giab008) > Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008 - [BWA](https://doi.org/10.1093/bioinformatics/btp324) > Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324 - [bwa-mem2](https://doi.org/10.1109/IPDPS.2019.00041) > Vasimuddin, Md., Misra, S., Li, H., & Aluru, S. (2019). Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 314–324. https://doi.org/10.1109/IPDPS.2019.00041 - [fastp](https://doi.org/10.1093/bioinformatics/bty560) > Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560 - [GATK](https://doi.org/10.1093/bioinformatics/btp324) > McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., & DePristo, M. A. (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303. https://doi.org/10.1101/gr.107524.110 - [Sambamba](https://doi.org/10.1093/bioinformatics/btv098) > Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J., & Prins, P. (2015). Sambamba: Fast processing of NGS alignment formats. Bioinformatics, 31(12), 2032–2034. https://doi.org/10.1093/bioinformatics/btv098 - [SAMtools](https://doi.org/10.1093/gigascience/giab008) > Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008 ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Watch event: 4
- Member event: 1
- Push event: 20
- Create event: 2
Last Year
- Watch event: 4
- Member event: 1
- Push event: 20
- Create event: 2