https://github.com/asadprodhan/how-to-run-kraken2-on-hpc-using-singularity-container-and-nextflow

An intuitive tutorial on Kraken2 metagenomic analysis

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary

Keywords

bash container hpc-cluster kraken2 linux-desktop nextflow singularity

Last synced: 5 months ago · JSON representation

Repository

An intuitive tutorial on Kraken2 metagenomic analysis

Basic Info

Host: GitHub
Owner: asadprodhan
Language: Nextflow
Default Branch: main
Homepage:
Size: 169 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Topics

bash container hpc-cluster kraken2 linux-desktop nextflow singularity

Created about 4 years ago · Last pushed almost 3 years ago

https://github.com/asadprodhan/How-to-run-Kraken2-on-HPC-using-Singularity-container-and-Nextflow/blob/main/

# **How to run Kraken2 on HPC using Singularity container and Nextflow?** 

Kraken2 is a widely used tool in metagenomic studies. It classifies metagenomic sequences into taxonomic ranks such as Species, Genus, Family etc.


## How does it work?

Kraken2 builds a database consisting of a k-mer and all the genomes that contain this k-mer. The metagenomic sequences are broken down into k-mers, and each k-mer is queried against the Kraken2-built k-mer database to classify the metagenomic sequences. Metagenomic sequences that have no matched k-mer in the database are labelled as unclassified (Wood et al., 2019; Wood and Salzberg, 2014). 

## How to run Kraken2 on HPC cluster using Singularity container and Nextflow?

Generally, the HPC providers do not allow their users to install softwares on the HPC. Singularity containers are a great alternative to physically installing softwares, and even does not require sudo privilege. Keeping record of the used containers and their versions facilitates reproducibility of the workflow. On the other hand, Nextflow is a bioinformatics workflow manager allowing the usage of containers.

However, executing Kraken2 (or any job) on HPC using Singularity container and Nextflow requires a set of three scripts as follows:

> Job Script: a job script written in nextflow (.nf) to do the actual job

> Config Script: a config script to provide with the containers link and computing resource allocations. By default, the name of this script is nextflow.config. If it is named differently, then it needs to be specified in the nextflow run command in the following job scheduler script as follows: "nextflow -C XXXXXX.config run "  
  
> Job Scheduler Script: a bash script to schedule the job through the job scheduler SLURM
  
  
  
 The scripts set for running Kraken2 on HPC provided by the Pawsey Supercomputer Centre (https://pawsey.org.au/) is presented below.
  
  
  #### Job Script:
  
  
  ``` #!/usr/bin/env nextflow

//data_location

params.in = "$PWD/*.fasta"
params.outdir = './results'
datasets = Channel
                .fromPath(params.in)
                .map { file -> tuple(file.simpleName, file) }


// taxonomy

process taxonomy {
    tag "$z"
    publishDir "${params.outdir}", mode:'copy'

    input:
    set datasetID, file(z) from datasets

    output:
    file "${z.baseName}_taxo.tsv" into taxonomy_ch
        
    script:
    """
    kraken2 --db path/to/the/DB --output ${z.baseName}_taxo.out --report ${z.baseName}_taxo.tsv $z --threads 28
    
    """
}
```
 
 
  #### Notes: 
  
    
  * The above database can be downloaded and built from scratch following the Kraken2 manual (https://github.com/DerrickWood/kraken2) 
  
  * Or, the available pre-built databases can be downloaded and used (https://benlangmead.github.io/aws-indexes/k2) 
	
	
	
  
	- create and 'cd' into a directory in your Linux computer to download the kraken2 pre-built databases in
	
	
	- go to the above pre-built databases website 
	
       
	- go to the HTTPS URL column of the collection table
	
      
	- right click on the tar.gz file of the corresponding collection
	
      
	- copy link address
	
      
	- run the following command
	
            
		```
		wget link_address
		```
        
	        
	- extract the tar zipped database as follows:
	
      		
		```
		tar -zxvf downloaded_database
		```
      
		
	- now, you need to refer this directory as you kraken 2 database in the kraken2 script
      

  
  * When running Kraken2, the database needs to be in the same computer where the command will be run (for example, in Zeus or Magnus at Pawsey). Preferably, in the same directory
  
  * Full path of the database needs to be given in the kraken2 command even if the database is in the same directory
  
  
  * At least 100 GB free disk space and 50 GB RAM are required. Kraken2 loads the database in the local RAM and use it from there. Lack of sufficient disk or memory space will result in an error "Error reading the hash table"
  
  
  * "Error reading the hash table" may stem from somehow corrupted files in the database. This might happen during transferring the unzipped database across computers. This problem can be resolved by re-extracting the zipped file of the downloaded Kraken2 database



  
  #### Config Script:
  
  
  ``` resume = true
trace {
  fields = 'name,hash,status,exit,realtime,submit'
}
profiles {
zeus {
  workDir = "$PWD/work"
  process {
    cache = 'lenient'
    stageInMode = 'symlink'
  }

process {
        withName:taxonomy { container = 'quay.io/biocontainers/kraken2:2.1.2--pl5262h7d875b9_0' }
    }

singularity {
 enabled = true
 autoMounts = true
 //runOptions = '-e TERM=xterm-256color'
 envWhitelist = 'TERM'
}
params.slurm_account = 'XXXXX'
  process {
    executor = 'slurm'
    clusterOptions = "--account=${params.slurm_account}"
    queue = 'workq'
    cpus = 1
    time = '1h'
    memory = '10GB'
        
    withName: 'taxonomy' {
      cpus = 28
      time = '24h'
    }     
}
}
}
```
  
  
  
  #### Job Scheduler Script:
  
  
 ``` #!/bin/bash -l 
#SBATCH --job-name=nxf-master 
#SBATCH --account=XXXX 
#SBATCH --partition=workq 
#SBATCH --time=1-00:00:00
#SBATCH --no-requeue 
#SBATCH --export=none 
#SBATCH --nodes=1

unset SBATCH_EXPORT 

module load singularity 
module load nextflow 

nextflow run nanopore_nextflow.nf -profile zeus -name nxf-${SLURM_JOB_ID} -resume --with-report
  ```


## How to run Kraken2 on a local Linux computer?

*	Install Kraken2 

*	Add the Kraken2 path to the PATH environmental variable

*	Download the appropriate Kraken2 database. To download the database, see the above notes

*	Make a directory for Kraken2 analysis

*	Keep the sequencing reads, the database, and the following script in the Kraken2 directory

*	Make a results directory with in the Kraken2 directory to collect the results

*	Run the script as follows: 

```
./kraken2.sh
```


#### The 'kraken2.sh' Script:


```
#!/usr/bin/env bash

#textFormating
Red="$(tput setaf 1)"
Green="$(tput setaf 2)"
reset=`tput sgr0` # turns off all atribute
Bold=$(tput bold)
#
for F in *.fastq
do

    baseName=$(basename $F .fastq)
    echo "${Red}${Bold} Processing ${reset}: "${baseName}""
    kraken2 --db $PWD/kraken2_database --threads 64 --output $PWD/results/"${baseName}_taxo.out" --report $PWD/results/"${baseName}_taxo.tsv" $F 
    echo ""
    echo "${Green}${Bold} Processed and saved as${reset} "${baseName}""
done

```

  
## Output
  
  kraken2 output file is tab delimited. A hypothetical example: 
  
  
  ``` 
  
  75	250	160	S	211044		Influenza A virus
  
  ```
  
  
  #### The columns from left-to-right are as follows:
  
  
* Column 1: Percentage of reads covered by the clade rooted at this taxon
  
* Column 2: Number of reads covered by the clade rooted at this taxon
  
* Column 3: Number of reads assigned directly to this taxon
  
* Column 4: A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply -.
  
* Column 5: NCBI Taxonomy ID
  
* Column 6: The scientific name

   
  


  
  
### References
  
  Wood, D.E., Lu, J., Langmead, B., 2019. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257. https://doi.org/10.1186/s13059-019-1891-0
  
  Wood, D.E., Salzberg, S.L., 2014. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46. https://doi.org/10.1186/gb-2014-15-3-r46

Owner

Name: Asad Prodhan
Login: asadprodhan
Kind: user
Location: Perth, Australia
Company: Department of Primary Industries and Regional Development

Website: www.linkedin.com/in/asadprodhan
Twitter: Asad_Prodhan
Repositories: 2
Profile: https://github.com/asadprodhan

Laboratory Scientist at DPIRD. My work involves Oxford Nanopore Sequencing and Bioinformatics for pest and pathogen diagnosis.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/asadprodhan/how-to-run-kraken2-on-hpc-using-singularity-container-and-nextflow

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/asadprodhan/How-to-run-Kraken2-on-HPC-using-Singularity-container-and-Nextflow/blob/main/

Owner

GitHub Events

Total

Last Year