prok-snptree
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 23 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: rknx
- Language: Shell
- Default Branch: main
- Size: 28.3 KB
Statistics
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
Prok-SNPTree
Prok-SNPTree is a pipeline to generate phylogenetic tree using core substitutions between reference genome and whole genomes Illumina sequencing.
Prok-SNPTree was originally designed and optimized for small prokaryotic genomes, but can perform reasonably well with larger genomes up to 100 Mb.
Prok-SNPtree comes with a helper file for providing required input info. The sbatch has preset information for running through SLURM scheduler, but should be able to run without SLURM.
Installation
Just place helper batch file and the script in working directory. Make the script executable.
chmod +x prok-snptree.sh
Dependencies
The following tools must be installed and their executables must be available in the ENV.
| Function | Tools/scripts |
| --- | --- |
| Parallelized sample processing | GNU Parallel ⇨ Source · Website |
| Quality check for raw reads | FᴀsᴛQC ⇨ Source · Website
MᴜʟᴛɪQC ⇨ Source · Reference · Website |
| Adapter identification and trimming | ᴄᴜᴛᴀᴅᴀᴘᴛ ⇨ Source · Reference
ᴛʀɪᴍ_ɢᴀʟᴏʀᴇ ⇨ Source · Reference · Website |
| Genome indexing and read alignment | ʙᴡᴀ ⇨ Source · Reference · Reference · Website |
| Binary conversion and sorting | Sᴀᴍᴛᴏᴏʟs ⇨ Source · Reference · Reference · Website |
| Variant calling and selection | GATK ⇨ Source · Reference · Website |
| Phylogenetic tree | RAxML ⇨ Source · Reference · Website |
| Pairwise SNP count | FᴀsᴛᴀTᴏSNPCᴏᴜɴᴛ.sʜ ⇨ Source |
Input files
- All paired gzipped fastq can be placed in working directory or in a subdirectory named
fastq. The pipeline does not support singletons currently. - Reference genome should be in
refssubdirectory, and namedgenome.fna. Symbolic links are accepted. Alternatively, the script can download it automatically (with wget) if a direct link is provided (see arguments below). - Reference annotation is not currently. For futureproofing, it may be provided inside
refssubdirectory asgenes.gtf. See alternative methods in arguments.
Arguments
The script accepts the following arguments, which are supplied from the helper sbatch file.
refgenome (Reference genome)
Direct link (url) to reference genome (gzipped). This for convenience, and the intended goal is to be able to download reference genome from NCBI etc. Keep its value empty to.if it will be supplied maunally (see input files above).refannotation (Annotation file for reference genome)
Direct link (url) to reference genome (gzipped) asrefgenome. This option is here for future function, and may be set to empty now.minDP, minQD, maxRDP, and minADP (VCF filtration parameters)
- minDP: Minimum sequencing depth (Positions that fail are considered absent)
- minQD: Minimum depth-normalized quality (SNPs that fail are ignored)
- maxRDP: Maximum reference allele depth for SNPs to be accepted as real.
- minADP: Minimum alternate allele depth for SNPs to be accepted as real.
ncpu and nmem (Parallelization parameters)
Number of CPUs and memory (overall) to use. If number of CPU is less than 8, only one sample is processed at a time with the number of CPU available. Otherwise, 8 CPUs are used per sample, and the samples are parallelized based on number of CPUs.nboot (Bootstrap)
Number of bootstraps to be used while preparing the phylogenetic tree with RAxML.
Running the program
If SLURM is available, edit the resource parameters in sbatch file and run as sbatch slurm.batch.
If running without SLURM, run as bash slurm.batch. This has not been throughly tested, and is not officially supported in the current version.
Optimizations
The pipeline is written so as to enable resuming or rerunning. The main output files are name systematically and are used as checkpoints.
Some examples:
- Trimming: non-empty \<sample>.og files in fastq subdirectory.
- Alignment: non-empty .bam file in align/<sample> subdirectory.
- Variant calling: non-empty .vcf file in variants/<sample> subdirectory.
If some samples fail, just rerun the pipeline after making changes with the inputs. Completed samples with valid outputs are not processed again.
If the pipeline exits in the middle of operation, just rerun, and the pipeline will pick up from last complete operation.
If you add new samples, just rerun the pipeline to process it.
Citation
A peer-reviewed paper is pending publication. Please cite the zenodo record at the moment as follows:
Sharma A. 2022. rknx/Prok-SNPTree: Phylogenetic Tree from Next-gen Sequencing of Prokaryotes (v0.1b). Zenodo. https://doi.org/10.5281/zenodo.7445133
Owner
- Name: Anuj Sharma
- Login: rknx
- Kind: user
- Location: Gainesville, FL
- Company: University of Florida
- Website: anujs.com.np
- Repositories: 32
- Profile: https://github.com/rknx
A little bit of Javascript, little bit of android and a lot of Photoshop. Plant pathologist by training.
Citation (CITATION.cff)
cff-version: 0.2b
message: "If you use this software, please cite it as below."
authors:
- family-names: Sharma
given-names: Anuj
orcid: https://orcid.org/0000-0000-0000-0000
title: rknx/Prok-SNPTree: Phylogenetic Tree from Next-gen Sequencing of Prokaryotes
doi: https://doi.org/10.5281/zenodo.7445133
version: v0.2b
date-released: 2022-12-16