wf-assembly-snps-gcp

https://github.com/phemarajata/wf-assembly-snps-gcp

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: PHemarajata
License: apache-2.0
Language: Nextflow
Default Branch: main
Size: 2.79 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 12 months ago

Metadata Files

Readme Changelog License Code of conduct Citation

wf-assembly-snps-gcp

GCP-Optimized Bacterial Genomics SNP Pipeline

This is a GCP-optimized version of the bacterial-genomics/wf-assembly-snps pipeline, specifically designed for cost-effective processing of hundreds of FASTA files on Google Cloud Platform.

🚀 Key Optimizations

70-80% Cost Reduction through Spot instances and algorithm optimization
5-10x Faster Execution with ClonalFrameML alternative to Gubbins
Enhanced Resource Management with adaptive parameter selection
Comprehensive Debugging with detailed logging and monitoring
Scalable Architecture for 50-1000+ genomes

🏃 Quick Start

Prerequisites

Google Cloud Project with billing enabled
Nextflow ≥22.04.3
Google Cloud SDK
Docker or Singularity

Basic Usage

```bash

Set your GCP project

export GOOGLECLOUDPROJECT="your-project-id"

Run with Google Batch (recommended)

nextflow run bacterial-genomics/wf-assembly-snps-gcp \ -profile gcb \ --input gs://your-bucket/input-fastas/ \ --outdir gs://your-bucket/results/ \ --gcpworkdir gs://your-bucket/nextflowwork \ --recombination clonalframeml \ --usespot true \ --estimatedgenomecount 200 \ --maxestimatedcost 50.0 ```

⚙️ GCP Setup

Automated Setup

```bash

Run the interactive setup script

chmod +x bin/setupgcp.sh ./bin/setupgcp.sh ```

Manual Setup

```bash

Enable required APIs

gcloud services enable compute.googleapis.com storage.googleapis.com lifesciences.googleapis.com

Create storage bucket

gsutil mb gs://your-bucket-name

Set environment variables

export GOOGLECLOUDPROJECT="your-project-id" export NXF_MODE=google ```

💰 Cost Optimization

Dataset Size Recommendations

Small Datasets (<100 genomes)

bash nextflow run bacterial-genomics/wf-assembly-snps-gcp \ -profile gcb \ --input gs://your-bucket/input/ \ --outdir gs://your-bucket/results/ \ --recombination clonalframeml \ --use_spot true \ --estimated_genome_count 50 \ --max_estimated_cost 25.0 Expected cost: $5-15

Medium Datasets (100-300 genomes)

bash nextflow run bacterial-genomics/wf-assembly-snps-gcp \ -profile gcb \ --input gs://your-bucket/input/ \ --outdir gs://your-bucket/results/ \ --recombination clonalframeml \ --use_spot true \ --estimated_genome_count 200 \ --max_estimated_cost 75.0 Expected cost: $15-50

Large Datasets (300+ genomes)

bash nextflow run bacterial-genomics/wf-assembly-snps-gcp \ -profile gcb \ --input gs://your-bucket/input/ \ --outdir gs://your-bucket/results/ \ --recombination clonalframeml \ --use_spot true \ --gubbins_cpus 8 \ --parsnp_cpus 8 \ --estimated_genome_count 500 \ --max_estimated_cost 200.0 Expected cost: $50-150

Algorithm Comparison

| Method | Speed | Cost | Accuracy | Best For | |--------|-------|------|----------|----------| | ClonalFrameML | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Most use cases | | Gubbins | ⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | High-precision studies | | None | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | N/A | Exploratory analysis |

📊 Usage Examples

Test Run

bash nextflow run bacterial-genomics/wf-assembly-snps-gcp \ -profile docker,test \ --outdir test-results

Production Run with Monitoring

bash nextflow run bacterial-genomics/wf-assembly-snps-gcp \ -profile gcp_minimal \ --input gs://your-bucket/input/ \ --outdir gs://your-bucket/results/ \ --recombination clonalframeml \ --use_spot true \ --monitor_resource_usage true \ --log_cost_estimates true \ --verbose_logging true

Debug Mode

bash nextflow run bacterial-genomics/wf-assembly-snps-gcp \ -profile gcp_minimal \ --input gs://your-bucket/input/ \ --outdir gs://your-bucket/results/ \ --debug_mode true \ --verbose_logging true \ --monitor_resource_usage true

🔧 GCP-Specific Parameters

Cost Control

--use_spot true - Use 70% cheaper Spot instances (replaces deprecated preemptible)
--use_preemptible true - Deprecated: use --use_spot instead (kept for backward compatibility)
--max_estimated_cost 100.0 - Budget limit in USD
--estimated_genome_count 200 - Help with cost estimation

Resource Optimization

--gubbins_cpus 4 - CPU allocation for Gubbins
--gubbins_memory "16 GB" - Memory allocation
--gubbins_iterations 3 - Reduce iterations for speed
--recombination clonalframeml - Use faster alternative

Monitoring & Debugging

--monitor_resource_usage true - Track performance
--log_cost_estimates true - Monitor spending
--verbose_logging true - Detailed logs
--debug_mode true - Enhanced debugging

🛠️ Available Profiles

gcb - Google Cloud Batch (recommended) - Modern Google Batch executor
gcp_minimal - Legacy GCP profile (updated for Google Batch)
docker - Standard Docker execution
singularity - Singularity container execution
test - Test with sample data

📈 Performance Comparison

Runtime (200 genomes)

| Process | Original | GCP-Optimized | Improvement | |---------|----------|---------------|-------------| | ParSNP | 3-4 hours | 3-4 hours | Same | | Gubbins | 8-12 hours | 4-6 hours | 40-50% faster | | ClonalFrameML | N/A | 1-2 hours | 80-90% faster |

Cost (200 genomes)

| Configuration | Compute Cost | Total Cost | Savings | |---------------|--------------|------------|---------| | Original | $100-150 | $105-160 | - | | GCP-Optimized | $25-35 | $30-45 | 70-75% |

🔍 Troubleshooting

Common Issues

Out of Memory: Use --recombination clonalframeml or increase --gubbins_memory
ClonalFrameML Hanging: Fixed with enhanced error handling and timeout detection
Spot Instance Interruption: Jobs restart automatically, check logs for status
Cost Overruns: Set --max_estimated_cost and monitor billing
Slow Performance: Enable --monitor_resource_usage to identify bottlenecks

Debug Commands

```bash

Check pipeline help

nextflow run bacterial-genomics/wf-assembly-snps-gcp --help

Test GCP connectivity

gcloud auth list gsutil ls gs://your-bucket/

Monitor costs

Visit: https://console.cloud.google.com/billing/

```

📚 Documentation

GCP Optimization Guide - Comprehensive setup and usage guide
GCP Optimization Summary - Quick reference for optimizations
Original Pipeline Documentation - Original pipeline docs

🤝 Contributing

This is a GCP-optimized fork of the original pipeline. For: - GCP-specific issues: Submit issues to this repository - Core pipeline issues: Submit to the original repository

📄 License

Same license as the original pipeline. See LICENSE file.

🙏 Credits

Original Pipeline: Christopher A. Gulvik
GCP Optimizations: Seqera AI
Framework: Nextflow

📞 Support

Check the GCP Optimization Guide
Run the setup script: ./bin/setup_gcp.sh
Use debug mode for troubleshooting
Monitor costs through GCP Console

Note: This is an optimized version specifically designed for Google Cloud Platform. For the original pipeline, visit bacterial-genomics/wf-assembly-snps.

Owner

Login: PHemarajata
Kind: user

Website: https://mstdn.science/@p_h
Repositories: 1
Profile: https://github.com/PHemarajata

Citation (CITATIONS.md)

# wf-assembly-snps: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [BioPython](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/)

  > Cock PJ, Antao T, Chang JT, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. Jun 1 2009;25(11):1422-3. doi:10.1093/bioinformatics/btp163

- [ClonalFrameML](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4326465/)

  > Didelot X, Wilson DJ. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput Biol. Feb 2015;11(2):e1004041. doi:10.1371/journal.pcbi.1004041

- [Gubbins](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4330336/)

  > Croucher NJ, Page AJ, Connor TR, et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. Feb 18 2015;43(3):e15. doi:10.1093/nar/gku1196

- [MUSCLE](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC390337/)

  > Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792-7. doi:10.1093/nar/gkh340

- [RAxML-NG](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6821337/)

  > Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. Nov 01 2019;35(21):4453-4455. doi:10.1093/bioinformatics/btz305

- [Parsnp 2.0](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10862825/)

  > Kille B, Nute MG, Huang V, Kim E, Phillippy AM, Treangen TJ. Parsnp 2.0: Scalable Core-Genome Alignment for Massive Microbial Datasets. bioRxiv. Jan 31 2024;doi:10.1101/2024.01.30.577458

- [snp-dists](https://github.com/tseemann/snp-dists)

  > Seemann T. snp-dists. <https://github.com/tseemann/snp-dists>

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

### Citation Information

Citations were created with the help of EndNote and were exported in AMA format.

GitHub Events

Total

Push event: 5

Last Year

Push event: 5

Dependencies

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan

modules/nf-core/custom/dumpsoftwareversions/environment.yml pypi

wf-assembly-snps-gcp

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

wf-assembly-snps-gcp

GCP-Optimized Bacterial Genomics SNP Pipeline

🚀 Key Optimizations

📋 Contents

🏃 Quick Start

Prerequisites

Basic Usage

Set your GCP project

Run with Google Batch (recommended)

⚙️ GCP Setup

Automated Setup

Run the interactive setup script

Manual Setup

Enable required APIs

Create storage bucket

Set environment variables

💰 Cost Optimization

Dataset Size Recommendations

Small Datasets (<100 genomes)

Medium Datasets (100-300 genomes)

Large Datasets (300+ genomes)

Algorithm Comparison

📊 Usage Examples

Test Run

Production Run with Monitoring

Debug Mode

🔧 GCP-Specific Parameters

Cost Control

Resource Optimization

Monitoring & Debugging

🛠️ Available Profiles

📈 Performance Comparison

Runtime (200 genomes)

Cost (200 genomes)

🔍 Troubleshooting

Common Issues

Debug Commands

Check pipeline help

Test GCP connectivity

Monitor costs

Visit: https://console.cloud.google.com/billing/

📚 Documentation

🤝 Contributing

📄 License

🙏 Credits

📞 Support

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Dependencies