cocci-call
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: ksw9
- Language: Nextflow
- Default Branch: main
- Size: 3.16 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Coccidioides species and variant identification pipeline
Pipeline for Coccidioides immitis and posadasii species and variant identification from short-read data for epidemiology and phylogenetics. Briefly, this pipeline takes raw short-read data, pre-processes reads (read adapter trimming and taxonomic filtering), maps them to a provided reference genome, calls variants, and outputs consensus FASTA sequences for downstream applications.
The pipeline is designed to be run on High Performance Computing (HPC) systems using the Slurm Workload Manager. Tasks are parallelized thanks to the Nextflow workflow system and the pipeline uses Docker containers to enhance reproducibility and portability.
/// ---------------------------------------- ///
Overview:
RESOURCESPREP workflow
Run this workflow by selecting the 'download_refs' profile (see 'RUN COMMAND' section below).
Reference FASTA and GFF download
Masking (optional) with RepeatMasker, or Nucmer, or both
FASTA indexing
BWA index generation
GATK dictionary generation
snpEff download and databases generation
Kraken2 database generation (bacteria, fungi, viral, Homo sapiens, and EuPathDB46 genomes)
VARIANTCALLING workflow
Run this workflow by selecting the 'variant_calling' profile (see 'RUN COMMAND' section below).
Reads trimming and QC with TrimGalore
Kraken2 filtering and species determination (immitis vs posadasii) based on number of unique minimizers
Mapping with BWA
Marking duplicates with GATK
Variant calling with GATK and (optionally) LoFreq
Annotation of vcf files with snpEff
Converting vcf to consensus FASTA
Summarizing of run info
Note that species determination is determined by a user-defined threshold for the log2 ratio of unique minimizers for immitis and posadasii. If species could not be determined at the Kraken2 step (e.g. threshold set too high), then the species is assigned at the BWA step, as the one for which mapping percentage is higher.
/// ---------------------------------------- ///
SETUP:
- Clone Github repo.
git clone https://github.com/ksw9/cocci-call.git
- Make sure that Nextflow and your container platform of choice (Docker, Podman, or Singularity) are installed. If using Lmod, load the necessary modules: e.g.
module load docker java nextflow
- Unzip the masking files in masking_files folder.
gzip -d masking_files/*.fa.gz
- Run RESOURCESPREP workflow to populate your resources directory with required references and databases. Note that Kraken2 database generation can take days, so make sure to submit a Slurm job with appropriate quality of service specifications.
cd cocci-call
nextflow run main.nf -profile singularity,download_refs --repeatmasker_mask true --nucmer_mask true --resources_dir "$(pwd)/resources"
- Modify the config file (nextflow.config):
Update resources_dir (full path to directory resources)
Verify resources paths (relative to resources_dir)
Set clusterOptions parameters according to your HPC settings
- Run pipeline on input data.
nextflow run main.nf -profile singularity --repeatmasker_mask true --nucmer_mask true
/// ---------------------------------------- ///
INPUT:
The variant calling workflow reads samples' info from a tab-delimited file, placed in "/path/to/resources_dir/input" The file has the following columns:
sample: unique sample identifier
fastq_1: full path to fastq mate 1
fastq_2: full path to fastq mate 2
batch: batch name
/// ---------------------------------------- ///
OUTPUTS:
RESOURCESPREP workflow
A folder named "resources" is created by default in the project directory. The directory has the following structure:
resources │ ├── immitis_bwa_index │ ├── immitis_gatk_dictionary │ ├── immitis_refs │ ├── input │ ├── kraken_db │ ├── posadasii_bwa_index │ ├── posadasii_gatk_dictionary │ ├── posadasii_refs │ └── snpEff
VARIANTCALLING workflow
All outputs are stored in the results directory, within the project directory. Directory structure mirrors the input reads file, with directories organized by batch, then sample.
results
│
└── batch_0
│
├── sample
| │
| ├── bams
| │
| ├── fasta
| │
| ├── kraken
| │
| ├── stats
| │
| ├── trim
| │
| └── vars
|
└── batch_n
/// ---------------------------------------- ///
RUN COMMAND:
nextflow run main.nf -profile [PROFILES] [OPTIONS]
PROFILES:
standard: runs the pipeline using Docker
docker: runs the pipeline using Docker
podman: runs the pipeline using Podman
singularity: runs the pipeline using Singularity
download_refs: runs the RESOURCESPREP workflow, which populates the "resources" directory with the necessary files for the analysis workflow
variant_calling: runs the VARIANTCALLING workflow
OPTIONS:
repeatmasker_mask: set to true if you want to mask the genome with RepeatMasker, false otherwise
nucmer_mask: set to true if you want to mask the genome with RepeatMasker, false otherwise
minimizerslogratio_thr: threshold for the log ratio of Kraken 2 C. immitis and C. posadasii minimizers
run_lofreq: set to true if you want to run LoFreq, false otherwise. Note that GATK is run anyway
seq_platform: name of sequencing platform to be added to reads by picard AddOrReplaceReadGroups prior to GATK processing
library: name of library to be added to reads by picard AddOrReplaceReadGroups prior to GATK processing
vcf_filter: filter for vcf files, based on any header column. Parameters need to be specified for variants to be kept, e.g. 'QUAL > 20' will result in any variant with quality below 20 to be discarded. Multiple parameters can be combined by "&&" (i.e. and) or "||" (i.e. or) operators, e.g. 'QD > 2.0 && FS < 60.0 && MQ > 40.0 && DP > 10 && GQ > 50 && QUAL > 20 && QUAL != "." && RGQ > 20 && (TYPE == "SNP" || TYPE == "indel" || TYPE != "insert")'
ploidy: Ploidy to be used by GATK
nextseq: set to true if data comes from an Illumina NextSeq system, false otherwise
nextseqqualthreshold: read quality threshold for Illumina NextSeq data
variants_only: set to true if GATK HaplotypeCaller should only output variant sites, false to output also invariant ones
fastacallsonly: set to true to only include vcf calls in the output fasta, false to modify the reference with vcf variants
/// ---------------------------------------- ///
DEPENDENCIES:
Nextflow 23.10.1+
Container platform, one of
- Docker 20.10.21+
- Podman
- Singularity 3.8.5+
Slurm
/// ---------------------------------------- ///
DAG
RESOURCESPREP workflow
VARIANTCALLING workflow
Owner
- Name: Katharine S. Walter
- Login: ksw9
- Kind: user
- Company: University of Utah
- Website: https://ksw9.github.io/
- Twitter: katwalter7
- Repositories: 4
- Profile: https://github.com/ksw9
Assistant Prof in Epidemiology. Pathogen evolution | ecology | transmission.
Citation (citation.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: cocci-call
message: >-
Variant identification and species assignment for
Coccidioides spp.
type: software
authors:
- given-names: Marco
family-names: Marchetti
- given-names: Kimberly
family-names: Hanson
- given-names: Bridget
family-names: Barker
- given-names: Katharine
family-names: Walter
name-particle: S.
email: katharine.walter@hsc.utah.edu
orcid: 'https://orcid.org/0000-0003-0065-2204'
identifiers:
- type: url
value: 'https://github.com/ksw9/cocci-call'
repository-code: 'https://github.com/ksw9/cocci-call'
url: 'https://github.com/ksw9/cocci-call'
abstract: >-
Pipeline for Coccidioides immitis and posadasii species
and variant identification from short-read data for
epidemiology and phylogenetics. Briefly, this pipeline
takes raw short-read data, pre-processes reads (read
adapter trimming and taxonomic filtering), maps them to a
provided reference genome, calls variants, and outputs
consensus FASTA sequences for downstream applications.
version: '1.0'
GitHub Events
Total
- Push event: 9
- Fork event: 1
Last Year
- Push event: 9
- Fork event: 1