https://github.com/asadprodhan/how_to_build_a_customised_kraken2_database
How_to_build_a_Customised_Kraken2_Database
https://github.com/asadprodhan/how_to_build_a_customised_kraken2_database
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Repository
How_to_build_a_Customised_Kraken2_Database
Basic Info
- Host: GitHub
- Owner: asadprodhan
- License: gpl-3.0
- Default Branch: main
- Size: 42 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
How to Build a Customised Kraken2 Database
M. Asaduzzaman Prodhan*
Step 1: Download the genomes
Keep these genome files (.fna or .fa) in a single directory.
Step 2: Download the taxonomy
Kraken2 requires taxonomy files from NCBI to classify sequences.
Below is a Slurm batch script for downloading taxonomy files on an HPC cluster:
```
!/bin/bash --login
SBATCH --job-name=K2DB # Name of the job
SBATCH --account=XXXX # Project account
SBATCH --partition=work # Partition (queue) to run the job
SBATCH --time=1-00:00:00 # Time limit (1 day)
SBATCH --ntasks=1 # Single task/job
SBATCH --ntasks-per-node=1
SBATCH --nodes=1 # Request 1 node
SBATCH --exclusive # Request exclusive access to the entire node (all CPUs), alternative --cpus-per-task=128
SBATCH --no-requeue
SBATCH --export=none
Activate the Kraken2 Conda environment
conda activate kraken2
Download the taxonomy file
Define DB directory
DBDIR=kraken2DB export KRAKENDBNAME=$DBDIR # <-- important fix
Step 1: Create DB directory and download taxonomy
kraken2-build --db $DBDIR --download-taxonomy --threads 120 --use-ftp
End of script
```
Explanation:
conda activate kraken2 → loads Kraken2 environment
export KRAKENDBNAME=$DBDIR → tells Kraken2 where to store the database
kraken2-build --download-taxonomy → fetches NCBI taxonomy files
--threads 120 → uses 120 CPU threads for faster download
--use-ftp → ensures NCBI files are retrieved via FTP
Step 3: Add the genomes to library
Once you have genomes downloaded, you need to add them to your Kraken2 database.
```
!/bin/bash --login
SBATCH --job-name=K2DB
SBATCH --account=XXXX
SBATCH --partition=work
SBATCH --time=1-00:00:00
SBATCH --ntasks=1
SBATCH --nodes=1
SBATCH --exclusive
SBATCH --no-requeue
SBATCH --export=none
Activate Kraken2 environment
conda activate kraken2
Fix Kraken2 environment variables
DBDIR=kraken2DB export KRAKENDBNAME=$DBDIR export KRAKEN_DIR=/path/to/miniconda3/envs/kraken2/libexec
Add the genome(s) to the library
find . -name "*.fna" -print0 | \ xargs -0 -I {} kraken2-build --db $DBDIR --add-to-library "{}" --threads 120 ```
Explanation:
find . -name "*.fna" → looks for all .fna files in current directory
xargs ... kraken2-build --add-to-library → adds each genome file into the Kraken2 library
--threads 120 → speeds up processing when handling many genomes
Step 4: Generate the seqid2taxid.map file
Kraken2 needs a mapping file between genome sequence IDs and taxonomy IDs. This ensures correct classification.
Step 4.1
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
Explanation:
- NCBI provides accession2taxid files mapping every accession to its taxonomy ID
Step 4.2
gunzip nucl_gb.accession2taxid.gz
Step 4.3
cut -f2,3 nucl_gb.accession2taxid > seqid2taxid.map
Explanation:
Kraken2 needs a slimmed-down seqid2taxid.map
Only column 2 (accession.version) and column 3 (taxid) are kept
Step 4.4
Copy the seqid2taxid.map to $Kraken2DB (not in the taxonomy or library directory)
Step 5: Build the DB
Now that taxonomy and genome libraries are in place, the final step is to compile the database.
```
!/bin/bash --login
SBATCH --job-name=K2DB # Name of the job
SBATCH --account=xxxx # Pawsey project account
SBATCH --partition=highmem # Partition (queue) to run the job
SBATCH --time=1-00:00:00 # Time limit (1 day)
SBATCH --ntasks=1 # Single task/job
SBATCH --ntasks-per-node=1
SBATCH --nodes=1 # Request 1 node
SBATCH --exclusive # Request exclusive access to the entire node (all CPUs), alternative --cpus-per-task=128
SBATCH --no-requeue
SBATCH --export=none
Activate Kraken2 environment
conda activate kraken2
Fix Kraken2 environment variables
DBDIR=kraken2DB export KRAKENDBNAME=$DBDIR export KRAKEN_DIR=/path/to/miniconda3/envs/kraken2/libexec
Build the database
kraken2-build --db $DBDIR --build --threads 120
End of script
```
Explanation:
kraken2-build --build → processes taxonomy + genomes into a searchable DB
--threads 120 → uses multiple cores for faster DB building
DBDIR naming convention (e.g., kraken2DB) helps version-control DBs
At this stage, your custom Kraken2 database is ready for classification!
Owner
- Name: Asad Prodhan
- Login: asadprodhan
- Kind: user
- Location: Perth, Australia
- Company: Department of Primary Industries and Regional Development
- Website: www.linkedin.com/in/asadprodhan
- Twitter: Asad_Prodhan
- Repositories: 2
- Profile: https://github.com/asadprodhan
Laboratory Scientist at DPIRD. My work involves Oxford Nanopore Sequencing and Bioinformatics for pest and pathogen diagnosis.
GitHub Events
Total
- Push event: 8
Last Year
- Push event: 8