https://github.com/asadprodhan/how_to_build_a_customised_kraken2_database

How_to_build_a_Customised_Kraken2_Database

https://github.com/asadprodhan/how_to_build_a_customised_kraken2_database

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

How_to_build_a_Customised_Kraken2_Database

Basic Info
  • Host: GitHub
  • Owner: asadprodhan
  • License: gpl-3.0
  • Default Branch: main
  • Size: 42 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

How to Build a Customised Kraken2 Database

M. Asaduzzaman Prodhan*

DPIRD Diagnostics and Laboratory Services
Department of Primary Industries and Regional Development
3 Baron-Hay Court, South Perth, WA 6151, Australia
*Correspondence: prodhan82@gmail.com


License GPL 3.0 ORCID


Step 1: Download the genomes

👉 How to Retrieve Public Data

Keep these genome files (.fna or .fa) in a single directory.


Step 2: Download the taxonomy

Kraken2 requires taxonomy files from NCBI to classify sequences.

Below is a Slurm batch script for downloading taxonomy files on an HPC cluster:

```

!/bin/bash --login

SBATCH --job-name=K2DB # Name of the job

SBATCH --account=XXXX # Project account

SBATCH --partition=work # Partition (queue) to run the job

SBATCH --time=1-00:00:00 # Time limit (1 day)

SBATCH --ntasks=1 # Single task/job

SBATCH --ntasks-per-node=1

SBATCH --nodes=1 # Request 1 node

SBATCH --exclusive # Request exclusive access to the entire node (all CPUs), alternative --cpus-per-task=128

SBATCH --no-requeue

SBATCH --export=none

Activate the Kraken2 Conda environment

conda activate kraken2

Download the taxonomy file

Define DB directory

DBDIR=kraken2DB export KRAKENDBNAME=$DBDIR # <-- important fix

Step 1: Create DB directory and download taxonomy

kraken2-build --db $DBDIR --download-taxonomy --threads 120 --use-ftp

End of script

```

Explanation:

  • conda activate kraken2 → loads Kraken2 environment

  • export KRAKENDBNAME=$DBDIR → tells Kraken2 where to store the database

  • kraken2-build --download-taxonomy → fetches NCBI taxonomy files

  • --threads 120 → uses 120 CPU threads for faster download

  • --use-ftp → ensures NCBI files are retrieved via FTP


Step 3: Add the genomes to library

Once you have genomes downloaded, you need to add them to your Kraken2 database.

```

!/bin/bash --login

SBATCH --job-name=K2DB

SBATCH --account=XXXX

SBATCH --partition=work

SBATCH --time=1-00:00:00

SBATCH --ntasks=1

SBATCH --nodes=1

SBATCH --exclusive

SBATCH --no-requeue

SBATCH --export=none

Activate Kraken2 environment

conda activate kraken2

Fix Kraken2 environment variables

DBDIR=kraken2DB export KRAKENDBNAME=$DBDIR export KRAKEN_DIR=/path/to/miniconda3/envs/kraken2/libexec

Add the genome(s) to the library

find . -name "*.fna" -print0 | \ xargs -0 -I {} kraken2-build --db $DBDIR --add-to-library "{}" --threads 120 ```

Explanation:

  • find . -name "*.fna" → looks for all .fna files in current directory

  • xargs ... kraken2-build --add-to-library → adds each genome file into the Kraken2 library

  • --threads 120 → speeds up processing when handling many genomes


Step 4: Generate the seqid2taxid.map file

Kraken2 needs a mapping file between genome sequence IDs and taxonomy IDs. This ensures correct classification.

Step 4.1

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz

Explanation:

  • NCBI provides accession2taxid files mapping every accession to its taxonomy ID

Step 4.2

gunzip nucl_gb.accession2taxid.gz

Step 4.3

cut -f2,3 nucl_gb.accession2taxid > seqid2taxid.map

Explanation:

  • Kraken2 needs a slimmed-down seqid2taxid.map

  • Only column 2 (accession.version) and column 3 (taxid) are kept

Step 4.4

Copy the seqid2taxid.map to $Kraken2DB (not in the taxonomy or library directory)


Step 5: Build the DB

Now that taxonomy and genome libraries are in place, the final step is to compile the database.

```

!/bin/bash --login

SBATCH --job-name=K2DB # Name of the job

SBATCH --account=xxxx # Pawsey project account

SBATCH --partition=highmem # Partition (queue) to run the job

SBATCH --time=1-00:00:00 # Time limit (1 day)

SBATCH --ntasks=1 # Single task/job

SBATCH --ntasks-per-node=1

SBATCH --nodes=1 # Request 1 node

SBATCH --exclusive # Request exclusive access to the entire node (all CPUs), alternative --cpus-per-task=128

SBATCH --no-requeue

SBATCH --export=none

Activate Kraken2 environment

conda activate kraken2

Fix Kraken2 environment variables

DBDIR=kraken2DB export KRAKENDBNAME=$DBDIR export KRAKEN_DIR=/path/to/miniconda3/envs/kraken2/libexec

Build the database

kraken2-build --db $DBDIR --build --threads 120

End of script

```

Explanation:

  • kraken2-build --build → processes taxonomy + genomes into a searchable DB

  • --threads 120 → uses multiple cores for faster DB building

  • DBDIR naming convention (e.g., kraken2DB) helps version-control DBs

At this stage, your custom Kraken2 database is ready for classification!

Owner

  • Name: Asad Prodhan
  • Login: asadprodhan
  • Kind: user
  • Location: Perth, Australia
  • Company: Department of Primary Industries and Regional Development

Laboratory Scientist at DPIRD. My work involves Oxford Nanopore Sequencing and Bioinformatics for pest and pathogen diagnosis.

GitHub Events

Total
  • Push event: 8
Last Year
  • Push event: 8