https://github.com/asadprodhan/how_to_build_a_customised_kraken2_database

How_to_build_a_Customised_Kraken2_Database

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

How_to_build_a_Customised_Kraken2_Database

Basic Info

Host: GitHub
Owner: asadprodhan
License: gpl-3.0
Default Branch: main
Size: 42 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 10 months ago · Last pushed 10 months ago

Metadata Files

Readme License

README.md

How to Build a Customised Kraken2 Database

M. Asaduzzaman Prodhan^*

DPIRD Diagnostics and Laboratory Services

Department of Primary Industries and Regional Development

3 Baron-Hay Court, South Perth, WA 6151, Australia

*Correspondence: prodhan82@gmail.com

Step 1: Download the genomes

👉 How to Retrieve Public Data

Keep these genome files (.fna or .fa) in a single directory.

Step 2: Download the taxonomy

Kraken2 requires taxonomy files from NCBI to classify sequences.

Below is a Slurm batch script for downloading taxonomy files on an HPC cluster:

```

!/bin/bash --login

SBATCH --job-name=K2DB # Name of the job

SBATCH --account=XXXX # Project account

SBATCH --partition=work # Partition (queue) to run the job

SBATCH --time=1-00:00:00 # Time limit (1 day)

SBATCH --ntasks=1 # Single task/job

SBATCH --ntasks-per-node=1

SBATCH --nodes=1 # Request 1 node

SBATCH --exclusive # Request exclusive access to the entire node (all CPUs), alternative --cpus-per-task=128

SBATCH --no-requeue

SBATCH --export=none

Activate the Kraken2 Conda environment

conda activate kraken2

Download the taxonomy file

Define DB directory

DBDIR=kraken2DB export KRAKENDBNAME=$DBDIR # <-- important fix

Step 1: Create DB directory and download taxonomy

kraken2-build --db $DBDIR --download-taxonomy --threads 120 --use-ftp

End of script

```

Explanation:

conda activate kraken2 → loads Kraken2 environment
export KRAKENDBNAME=$DBDIR → tells Kraken2 where to store the database
kraken2-build --download-taxonomy → fetches NCBI taxonomy files
--threads 120 → uses 120 CPU threads for faster download
--use-ftp → ensures NCBI files are retrieved via FTP

Step 3: Add the genomes to library

Once you have genomes downloaded, you need to add them to your Kraken2 database.

```

!/bin/bash --login

SBATCH --job-name=K2DB

SBATCH --account=XXXX

SBATCH --partition=work

SBATCH --time=1-00:00:00

SBATCH --ntasks=1

SBATCH --nodes=1

SBATCH --exclusive

SBATCH --no-requeue

SBATCH --export=none

Activate Kraken2 environment

conda activate kraken2

Fix Kraken2 environment variables

DBDIR=kraken2DB export KRAKENDBNAME=$DBDIR export KRAKEN_DIR=/path/to/miniconda3/envs/kraken2/libexec

Add the genome(s) to the library

find . -name "*.fna" -print0 | \ xargs -0 -I {} kraken2-build --db $DBDIR --add-to-library "{}" --threads 120 ```

Explanation:

find . -name "*.fna" → looks for all .fna files in current directory
xargs ... kraken2-build --add-to-library → adds each genome file into the Kraken2 library
--threads 120 → speeds up processing when handling many genomes

Step 4: Generate the seqid2taxid.map file

Kraken2 needs a mapping file between genome sequence IDs and taxonomy IDs. This ensures correct classification.

Step 4.1

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz

Explanation:

NCBI provides accession2taxid files mapping every accession to its taxonomy ID

Step 4.2

gunzip nucl_gb.accession2taxid.gz

Step 4.3

cut -f2,3 nucl_gb.accession2taxid > seqid2taxid.map

Explanation:

Kraken2 needs a slimmed-down seqid2taxid.map
Only column 2 (accession.version) and column 3 (taxid) are kept

Step 4.4

Copy the seqid2taxid.map to $Kraken2DB (not in the taxonomy or library directory)

Step 5: Build the DB

Now that taxonomy and genome libraries are in place, the final step is to compile the database.

```

!/bin/bash --login

SBATCH --job-name=K2DB # Name of the job

SBATCH --account=xxxx # Pawsey project account

SBATCH --partition=highmem # Partition (queue) to run the job

SBATCH --time=1-00:00:00 # Time limit (1 day)

SBATCH --ntasks=1 # Single task/job

SBATCH --ntasks-per-node=1

SBATCH --nodes=1 # Request 1 node

SBATCH --exclusive # Request exclusive access to the entire node (all CPUs), alternative --cpus-per-task=128

SBATCH --no-requeue

SBATCH --export=none

Activate Kraken2 environment

conda activate kraken2

Fix Kraken2 environment variables

DBDIR=kraken2DB export KRAKENDBNAME=$DBDIR export KRAKEN_DIR=/path/to/miniconda3/envs/kraken2/libexec

Build the database

kraken2-build --db $DBDIR --build --threads 120

End of script

```

Explanation:

kraken2-build --build → processes taxonomy + genomes into a searchable DB
--threads 120 → uses multiple cores for faster DB building
DBDIR naming convention (e.g., kraken2DB) helps version-control DBs

At this stage, your custom Kraken2 database is ready for classification!

Owner

Name: Asad Prodhan
Login: asadprodhan
Kind: user
Location: Perth, Australia
Company: Department of Primary Industries and Regional Development

Website: www.linkedin.com/in/asadprodhan
Twitter: Asad_Prodhan
Repositories: 2
Profile: https://github.com/asadprodhan

Laboratory Scientist at DPIRD. My work involves Oxford Nanopore Sequencing and Bioinformatics for pest and pathogen diagnosis.

https://github.com/asadprodhan/how_to_build_a_customised_kraken2_database

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

How to Build a Customised Kraken2 Database

M. Asaduzzaman Prodhan*

Step 1: Download the genomes

Step 2: Download the taxonomy

!/bin/bash --login

SBATCH --job-name=K2DB # Name of the job

SBATCH --account=XXXX # Project account

SBATCH --partition=work # Partition (queue) to run the job

SBATCH --time=1-00:00:00 # Time limit (1 day)

SBATCH --ntasks=1 # Single task/job

SBATCH --ntasks-per-node=1

SBATCH --nodes=1 # Request 1 node

SBATCH --exclusive # Request exclusive access to the entire node (all CPUs), alternative --cpus-per-task=128

SBATCH --no-requeue

SBATCH --export=none

Activate the Kraken2 Conda environment

Download the taxonomy file

Define DB directory

Step 1: Create DB directory and download taxonomy

End of script

Step 3: Add the genomes to library

!/bin/bash --login

SBATCH --job-name=K2DB

SBATCH --account=XXXX

SBATCH --partition=work

SBATCH --time=1-00:00:00

SBATCH --ntasks=1

SBATCH --nodes=1

SBATCH --exclusive

SBATCH --no-requeue

SBATCH --export=none

Activate Kraken2 environment

Fix Kraken2 environment variables

Add the genome(s) to the library

Step 4: Generate the seqid2taxid.map file

Step 4.1

Step 4.2

Step 4.3

Step 4.4

Step 5: Build the DB

!/bin/bash --login

SBATCH --job-name=K2DB # Name of the job

SBATCH --account=xxxx # Pawsey project account

SBATCH --partition=highmem # Partition (queue) to run the job

SBATCH --time=1-00:00:00 # Time limit (1 day)

SBATCH --ntasks=1 # Single task/job

SBATCH --ntasks-per-node=1

SBATCH --nodes=1 # Request 1 node

SBATCH --exclusive # Request exclusive access to the entire node (all CPUs), alternative --cpus-per-task=128

SBATCH --no-requeue

SBATCH --export=none

Activate Kraken2 environment

Fix Kraken2 environment variables

Build the database

End of script

At this stage, your custom Kraken2 database is ready for classification!

Owner

GitHub Events

Total

Last Year

M. Asaduzzaman Prodhan^*