poolwgs2snp

Workflow for processing whole genome pool-seq data and identifying SNPs

https://github.com/sfeng666/poolwgs2snp

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 12 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Workflow for processing whole genome pool-seq data and identifying SNPs

Basic Info
  • Host: GitHub
  • Owner: Sfeng666
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 5.84 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 4 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

poolWGS2SNP: a high-performance workflow to identify SNPs from pool-seq data


The bioinformatic workflow to call SNPs from Fastq files of raw pool-sequenced pair-end DNA reads.

Related publication:

Feng S, DeGrey SP, Guédot C, Schoville SD, Pool JE. 2024. Genomic Diversity Illuminates the Environmental Adaptation of Drosophila suzukii. Genome Biology and Evolution 16:evae195.

Features

  • Built and optimized for pool-sequenced data.
  • The alignment step is made hundreds or even thousands of times as fast, depending on the number of threads available at your computer/cluster. Other than processing each sample as a whole, we automatically split each fastq file into files of any given size, align parallelly to the reference genome, and then combine them into one aligned BAM file for downstream processing.

Environment

This workflow was built on DAGMan (Directed Acyclic Graph Manager), and is primarily designed to run through the HTCondor job scheduler (set up to run on UW-Madison's CHTC).

However, shell scripts of each step could still be run independently, as long as required input is provided.

To install the software environment, you could use conda: conda env create -n WGS_analysis --file WGS_analysis.yml If you have not installed conda, run the following command: ```

download miniconda

curl -sL \ "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" > \ "Miniconda3.sh"

install miniconda

bash Miniconda3.sh ```

Input

  • a table of paths to fastq files containing raw pair-end (PE) whole genome sequencing (WGS) reads (example);
  • reference genomic sequence (.fasta) and an fasta index (.fai) to allow fast access to the genome;
  • bwa index of reference genomic sequence;
  • annotation of reference genome in .gff and .gtf formats

Output

  • a table of SNPs in VCF format with an additional column of variant annotations;
  • (optional) a report of the depth and coverage of clean mapped reads;
  • (optional) a full report on the number and proportion of remaining reads after each step of quality control and filtering, as well as the depth and coverage of clean mapped reads;

Workflow diagram

Workflow diagram

The analysis pipeline to call SNPs from pool-seq raw reads. Grey shadings and bold texts represent the three major parts of this pipeline. Input and output are indicated by elliptical boxes. Required steps are indicated by rectangular boxes and arrows in solid lines. Optional steps are indicated by dashed boxes and arrows. Names and versions of used software are colored in blue

Reference

DrosEU pipeline: Kapun, M., Barrón, M. G., Staubach, F., Obbard, D. J., Wiberg, R. A. W., Vieira, J., Goubert, C., Rota-Stabelli, O., Kankare, M., Bogaerts-Márquez, M., Haudry, A., Waidele, L., Kozeretska, I., Pasyukova, E. G., Loeschcke, V., Pascual, M., Vieira, C. P., Serga, S., Montchamp-Moreau, C., … González, J. (2020). Genomic Analysis of European Drosophila melanogaster Populations Reveals Longitudinal Structure, Continent-Wide Selection, and Previously Unknown DNA Viruses. Molecular Biology and Evolution, 37(9), 2661–2678. https://doi.org/10.1093/molbev/msaa120

fastp: Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560

bwa mem: Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (arXiv:1303.3997). arXiv. https://doi.org/10.48550/arXiv.1303.3997

Samtools: Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352

Picard

GATK: Auwera, G. A. V. der, & O’Connor, B. D. (2020). Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. O’Reilly Media, Inc.

bamdst

PoolSNP

SnpEff: Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X., & Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w 1118 ; iso-2; iso-3. Fly, 6(2), 80–92. https://doi.org/10.4161/fly.19695


Owner

  • Name: Siyuan Feng
  • Login: Sfeng666
  • Kind: user
  • Location: Madison, WI
  • Company: University of Wisconsin-Madison

Evolutionary biologist working on population genomics & comparative transcriptomics

Citation (CITATION)

To reference poolWGS2SNP in publications, please cite:

@article {Feng2023.07.03.547576,
        author = {Siyuan Feng and Samuel P. DeGrey and Christelle Gu{\'e}dot and Sean D. Schoville and John E. Pool},
        title = {Genomic Diversity Illuminates the Species History and Environmental Adaptation of Drosophila suzukii},
        elocation-id = {2023.07.03.547576},
        year = {2023},
        doi = {10.1101/2023.07.03.547576},
        publisher = {Cold Spring Harbor Laboratory},
        URL = {https://www.biorxiv.org/content/early/2023/07/03/2023.07.03.547576},
        eprint = {https://www.biorxiv.org/content/early/2023/07/03/2023.07.03.547576.full.pdf},
        journal = {bioRxiv}
}

GitHub Events

Total
Last Year