prok-snptree

https://github.com/rknx/prok-snptree

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 23 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: rknx
Language: Shell
Default Branch: main
Size: 28.3 KB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 2

Created over 4 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation Zenodo

Prok-SNPTree

Prok-SNPTree is a pipeline to generate phylogenetic tree using core substitutions between reference genome and whole genomes Illumina sequencing.
Prok-SNPTree was originally designed and optimized for small prokaryotic genomes, but can perform reasonably well with larger genomes up to 100 Mb.

Prok-SNPtree comes with a helper file for providing required input info. The sbatch has preset information for running through SLURM scheduler, but should be able to run without SLURM.

Installation

Just place helper batch file and the script in working directory. Make the script executable. chmod +x prok-snptree.sh

Dependencies

The following tools must be installed and their executables must be available in the ENV.

| Function | Tools/scripts | | --- | --- | | Parallelized sample processing | GNU Parallel ⇨ Source · Website | | Quality check for raw reads | FᴀsᴛQC ⇨ Source · Website
MᴜʟᴛɪQC ⇨ Source · Reference · Website | | Adapter identification and trimming | ᴄᴜᴛᴀᴅᴀᴘᴛ ⇨ Source · Reference
ᴛʀɪᴍ_ɢᴀʟᴏʀᴇ ⇨ Source · Reference · Website | | Genome indexing and read alignment | ʙᴡᴀ ⇨ Source · Reference · Reference · Website | | Binary conversion and sorting | Sᴀᴍᴛᴏᴏʟs ⇨ Source · Reference · Reference · Website | | Variant calling and selection | GATK ⇨ Source · Reference · Website | | Phylogenetic tree | RAxML ⇨ Source · Reference · Website | | Pairwise SNP count | FᴀsᴛᴀTᴏSNPCᴏᴜɴᴛ.sʜ ⇨ Source |

Input files

All paired gzipped fastq can be placed in working directory or in a subdirectory named fastq. The pipeline does not support singletons currently.
Reference genome should be in refs subdirectory, and named genome.fna. Symbolic links are accepted. Alternatively, the script can download it automatically (with wget) if a direct link is provided (see arguments below).
Reference annotation is not currently. For futureproofing, it may be provided inside refs subdirectory as genes.gtf. See alternative methods in arguments.

Arguments

The script accepts the following arguments, which are supplied from the helper sbatch file.

refgenome (Reference genome)
Direct link (url) to reference genome (gzipped). This for convenience, and the intended goal is to be able to download reference genome from NCBI etc. Keep its value empty to . if it will be supplied maunally (see input files above).
refannotation (Annotation file for reference genome)
Direct link (url) to reference genome (gzipped) as refgenome. This option is here for future function, and may be set to empty now.
minDP, minQD, maxRDP, and minADP (VCF filtration parameters)

- minDP: Minimum sequencing depth (Positions that fail are considered absent)
- minQD: Minimum depth-normalized quality (SNPs that fail are ignored)
- maxRDP: Maximum reference allele depth for SNPs to be accepted as real.
- minADP: Minimum alternate allele depth for SNPs to be accepted as real.

ncpu and nmem (Parallelization parameters)
Number of CPUs and memory (overall) to use. If number of CPU is less than 8, only one sample is processed at a time with the number of CPU available. Otherwise, 8 CPUs are used per sample, and the samples are parallelized based on number of CPUs.
nboot (Bootstrap)
Number of bootstraps to be used while preparing the phylogenetic tree with RAxML.

Running the program

If SLURM is available, edit the resource parameters in sbatch file and run as sbatch slurm.batch.
If running without SLURM, run as bash slurm.batch. This has not been throughly tested, and is not officially supported in the current version.

Optimizations

The pipeline is written so as to enable resuming or rerunning. The main output files are name systematically and are used as checkpoints.
Some examples: - Trimming: non-empty \<sample>.og files in fastq subdirectory. - Alignment: non-empty .bam file in align/<sample> subdirectory. - Variant calling: non-empty .vcf file in variants/<sample> subdirectory.

If some samples fail, just rerun the pipeline after making changes with the inputs. Completed samples with valid outputs are not processed again.

If the pipeline exits in the middle of operation, just rerun, and the pipeline will pick up from last complete operation.

If you add new samples, just rerun the pipeline to process it.

Citation

A peer-reviewed paper is pending publication. Please cite the zenodo record at the moment as follows:

Sharma A. 2022. rknx/Prok-SNPTree: Phylogenetic Tree from Next-gen Sequencing of Prokaryotes (v0.1b). Zenodo. https://doi.org/10.5281/zenodo.7445133

Owner

Name: Anuj Sharma
Login: rknx
Kind: user
Location: Gainesville, FL
Company: University of Florida

Website: anujs.com.np
Repositories: 32
Profile: https://github.com/rknx

A little bit of Javascript, little bit of android and a lot of Photoshop. Plant pathologist by training.

Citation (CITATION.cff)

cff-version: 0.2b
message: "If you use this software, please cite it as below."
authors:
  - family-names: Sharma
    given-names: Anuj
    orcid: https://orcid.org/0000-0000-0000-0000
title: rknx/Prok-SNPTree: Phylogenetic Tree from Next-gen Sequencing of Prokaryotes
doi: https://doi.org/10.5281/zenodo.7445133
version: v0.2b
date-released: 2022-12-16

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science