wgs_bscript

This is a bash script designed for basic whole genome analysis workflow in Compute Canada platform.

https://github.com/tiffanyfeng08/wgs_bscript

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

This is a bash script designed for basic whole genome analysis workflow in Compute Canada platform.

Basic Info
  • Host: GitHub
  • Owner: TiffanyFeng08
  • License: mit
  • Language: Shell
  • Default Branch: main
  • Homepage:
  • Size: 79.1 KB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 2
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

WGS_Bscript

Simple script for bacterial whole genome sequences analysis

This bash script is prepared for whole genome sequencing data analysis on the Digital Research Alliance of Canada server. The target is to use only the tools in the Compute Canada module without installing extra libraries or tools and create a simple workflow for paired-end short-read sequence analysis, including QC, trimming, assembly, and annotation. The test was conducted in the Compute Canada Beluga Cluster. (The head of the scripts might be differ from different cluster, e.g. Niagara Cluster)


Workflow

```mermaid %%{ init: { 'flowchart': { 'curve': 'stepAfter' } } }%% flowchart TD

 A[Raw Reads] -- Trimmomatic --> B[Trimmed Reads] -- SPAdes assembler --> C[Assembly Contigs]--Filter contigs < 1kb-->F[Filtered Contigs]-- Prokka annotation --> D[Annotated Genome]

```


Preparation:

  1. Create a directory to put your results and get its pathway.

  2. Collect or make your own primer file (it should be '.fa' or '.fasta' file) and get its pathway

  3. Modified bash script

Download the wgs_bscript. Then change the head of this job (You might need to change the time, memory and node depending on the number of sequences. But you need at least 150G for SPAdes, or you can change the setting in SPAdes)

#!/bin/bash #SBATCH --account=the name of your account #SBATCH --time=48:00:00 #SBATCH --job-name=the name of your job #SBATCH --output=%x-%j.out #SBATCH --mail-user=your email #SBATCH --mail-type=ALL #SBATCH --mem-per-cpu=186G #SBATCH -n 2

Provide the raw data directory at line 26

raw_data_path=/path/to/your/raw/data

Define your working directory at line 29: where you want to put your results

working_path=/path/to/your/working/directory/

Define your primer path and file at line 32.

primer_path=/path/to/your/primer/yourprimerfilename.fa

If your sequence does not end with " R1001.fastq.gz" or " R2001.fastq.gz", you also need to change the tag (at line 35 and 36) and format (at line 39):

``` #Define the tag of your sequence tag1="R1001" tag2="R2001"

#Define the format of your sequence fo=fastq.gz ```


Notes

  1. In this script, you can change all the versions of your tools from line 11 to 20.

  2. In this script, we will rename the raw reads, only the part before the first "" as the sample name. If you do not want to rename or your naming system is different, please modify or delete line 44 to 55. (e.g. "J-D0-22S35L001R2001.fastq.gz" to "J-D0-22R2_001.fastq.gz").

  3. In this script, you should change all the parameters of the tools based on your needs.


Usage:

You can directly upload the modified script, or; Use the "nano" command to create your script and directly paste the commands in the script:

```

Go to the working directory or anydirectory you want to put your script

nano WGS_Bscript

After creating the script, copy and paste your modified script and save it with "CTRL + o".

```

If you want to run the pipeline on Compute Canada, execute the following:

sbatch WGS_Bscript

Directory Structure and Results:

  1. Raw Reads Quality Control:
  • FastQC Results: Quality control reports for raw reads.
    • $working_path/result/QC/Rawreads/FastQC_result: Contains FastQC reports for raw reads.
  1. Trimmed Reads:
  • Trimmed Paired-End Reads: Trimmed paired-end reads after adapter removal and quality trimming.

    • $working_path/result/Trim/both_sequence: Contains both paired and unpaired trimmed reads.
    • $working_path/result/Trim/trimmed_paired_sequence: Contains only paired trimmed reads.
    • Trimmed Reads Quality Control:
  • FastQC Results: Quality control reports for trimmed reads.

    • $working_path/result/QC/TrimQC_result: Contains FastQC reports for trimmed reads.
  1. SPAdes Assembly:
  • Assembly Results: Assembled contigs from SPAdes.
    • $working_path/result/SPAdes/result: Contains SPAdes assembly results for each sample.
    • $working_path/result/SPAdes/all_contigs: Contains all contigs from the SPAdes assembly.
  1. Quality Assessment of Contigs:
  • Quast Results: Quality assessment of assembled contigs using Quast.
    • $working_path/result/QC/Assembly/Quast/Contigs: Contains Quast reports for the assembled contigs.
  1. Filtered Contigs:
  • Filtered Contigs: Contigs filtered to remove those shorter than 1kb.
    • $working_path/result/Assembly/Filtered_Contigs: Contains filtered contigs for each sample.
  1. Quality Assessment of Filtered Contigs:
  • Quast Results: Quality assessment of filtered contigs using Quast.

    • $working_path/result/QC/Assembly/Quast/Filtered_Contigs: Contains Quast reports for the filtered contigs.
    • Coverage Calculation:
  • BBMap Results: Coverage calculation and pileup reports using BBMap.

    • $working_path/result/QC/Assembly/BBmap: Contains BAM files and pileup reports for each sample.
    • $working_path/result/QC/Assembly/BBmap/pileup_reports: Contains individual pileup reports.
    • $working_path/result/QC/Assembly/BBmap/pileup_reports/pileup_summary.txt: Combined summary of all pileup reports.
    • '$workingpath/result/QC/Assembly/BBmap/covstatsreports':Contains summary statistics of the coverage for the entire genome.
    • '$workingpath/result/QC/Assembly/BBmap/covstatssummary.txt': Contains the coverage statistics for all samples.
  1. Annotation:
  • Prokka Results: Annotation results using Prokka.
    • $working_path/result/Prokka: Contains Prokka annotation results for each sample.
    • $working_path/result/QC/Prokka: Contains a summary CSV file with counts of contigs, bases, CDS, tRNA, and tmRNA for each sample.

Expected Output Files:

  • FastQC Reports.html and .zip files for each raw and trimmed read.
  • Trimmed Reads.fastq.gz files for paired and unpaired trimmed reads.
  • SPAdes Assemblycontigs.fasta files for each sample.
  • Quast Reports.html and other summary files for contig quality assessment.
  • Filtered Contigsfiltered_contigs.fasta files for each sample.
  • BBMap Results.bam and .txt files for coverage calculation and pileup reports.
  • Prokka Annotation: Various annotation files and a summary CSV file.

Owner

  • Name: Zhixuan (Tiffany) Feng
  • Login: TiffanyFeng08
  • Kind: user

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: WGS_Bscript
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: 'Zhixuan '
    family-names: Feng
identifiers:
  - type: doi
    value: 10.5281/zenodo.14860248
url: 'https://github.com/TiffanyFeng08/WGS_Bscript/tree/main'
abstract: >-
  This bash script is prepared for whole genome sequencing
  data analysis on the Compute Canada server. The target is
  to use only the tools in the Compute Canada module without
  installing extra libraries or tools and create a simple
  workflow for paired-end short-read sequence analysis,
  including QC, trimming, assembly, and annotation. The test
  was conducted in the Compute Canada Beluga Cluster. (The
  head of the scripts might be differ from different
  cluster, e.g. Niagara Cluster)
license: MIT
version: 1.0.1
date-released: '2025-02-11'

GitHub Events

Total
  • Release event: 1
  • Watch event: 6
  • Push event: 4
  • Public event: 1
  • Fork event: 1
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 6
  • Push event: 4
  • Public event: 1
  • Fork event: 1
  • Create event: 1