ena-spike-ntd-repdel-analysis
A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. Manuscript under review.
https://github.com/pathogenomics-lab/ena-spike-ntd-repdel-analysis
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 10 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.2%) to scientific vocabulary
Keywords
Repository
A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. Manuscript under review.
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
ENA spike NTD repaired deletion analysis
A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. The workflow processes sequencing data retrieved from the ENA Portal API, performing quality filtering, read mapping, variant calling, and classification of the deletion repair genotype. This pipeline was developed as part of a larger study.
Results generated with this pipeline are available via DOI: 10.20350/digitalCSIC/17032. We ran Snakemake v8.25.3 with Python v3.12.7.
Workflow summary
This pipeline fetches and processes SARS-CoV-2 "read run" ENA records with a sample collection date between 1 November 2021 and 1 August 2022, filtered for NCBI taxonomy code 2697049 (SARS-CoV-2) and Homo sapiens host, excluding sequencing platforms DNBseq, Element and capillary sequencing, and RNAseq, transcriptomic, metagenomic, and metatranscriptomic library strategies. Then, the following steps are run for each resulting record:
- FASTQ retrieval via the ENA metadata FTP URLs.
- Read preprocessing and quality filtering using
fastpv0.23.4. - Read mapping with
minimap2v2.28 against a BA.1 reference genome (GenBank: OZ070629.1) using the recommended presets depending on the run sequencing platform. - Consensus genome generation with
samtoolsv1.20 andiVarv1.4.3. - Lineage assignment using
pangolinv4.3. - Variant calling with
iVarv1.4.3, annotated withSnpEffv5.2 and filtered withSnpSiftv5.2. - Classification in three "haplotypes", based on the presence or absence of S gene deletions ΔH69/V70 and ΔV143/Y145. Alleles are encoded as insertions in HGVS nomenclature, given the reference genome:
Rep_69_70: repair of S:ΔH69/V70 (S:p.Val67_Ile68dupdetected,S:p.Asp140_His141insValTyrTyrabsent).Rep_143_145: repair of S:ΔV143/Y145 (S:p.Asp140_His141insValTyrTyrdetected,S:p.Val67_Ile68dupabsent).Rep_Both: repair of both deletions (S:p.Val67_Ile68dupandS:p.Asp140_His141insValTyrTyrdetected).
- Data summarization and visualization using
apev5.8,Rsamtoolsv2.18.0,tidyversev2.0.0, andggpubrv0.6.0 in R v4.3.3.
Usage
This repository contains a Snakemake workflow for processing sequencing data from FASTQ retrieval to classification and result summarization. The pipeline is conceptualized in two main sections: (1) an independent, linear processing pipeline for each record, and (2) summarization tasks that aggregate results and generate reports. Due to the large dataset size, a LIGHT configuration flag is available to execute only the first section of the DAG, reducing computational load. This also enables a batcher rule that allows execution using Snakemake batches if needed.
1. Data retrieval and chunking
00a_run_search.sh: queries the ENA Portal API to retrieve sequencing records.00b_generate_chunks.sh: splits survey results into manageable chunks for processing via SLURM job arrays. The dataset is divided into 16 groups, each containing up to 5000 chunks, with each chunk holding 16 records. This approach addressed Snakemake limitations when handling large DAGs at the time of execution. Chunk settings were set considering our HPC resource limits.
2. Haplotype classification
01_run_haplotypes_array_chunked.sh <group number>: runs the analysis for a specified chunk group. Each execution launches up to 5000 SLURM jobs through a job array. This step must be run for each group. For the final manuscript, only a subset of groups was analyzed. This step is executed withLIGHT=True. Most parameters can be tweaked via the Snakemakeconfig.yamlfile.
3. Summary and reporting
02_run_complete.sh: Executes the full workflow to generate summary tables, visualizations, and reports. This step is executed withLIGHT=False. Given computational constraints, the final manuscript analysis usedscripts/summarize_results.pyinstead, which parses result files to produce a summary table with key measurements, and data visualizations were created manually using the same code integrated into the workflow.
Citation
Álvarez-Herrera M, Ruiz-Rodriguez P, Navarro-Domínguez B, Zulaica J, Grau B, Bracho MA, Guerreiro M, Aguilar-Gallardo C, González-Candelas F, Comas I, Geller R & Coscollá M (2025). Genome data artifacts and functional studies of deletion repair in the BA.1 SARS-CoV-2 spike protein. Virus Evolution, 11(1), veaf015. https://doi.org/10.1093/ve/veaf015
See also CITATION.cff.
Owner
- Name: PathoGenOmics Lab
- Login: PathoGenOmics-Lab
- Kind: organization
- Location: Spain
- Website: https://www.uv.es/pathogenomic
- Twitter: gen_UV
- Repositories: 1
- Profile: https://github.com/PathoGenOmics-Lab
Citation (CITATION.cff)
cff-version: 1.2.0
message: "This pipeline was developed as part of a larger study. If you use this software, please cite it as below."
authors:
- family-names: "Álvarez-Herrera"
given-names: "Miguel"
orcid: "https://orcid.org/0000-0002-7922-3180"
- family-names: "Ruiz-Rodriguez"
given-names: "Paula"
orcid: "https://orcid.org/0000-0003-0727-5974"
title: "ENA spike NTD repaired deletion analysis"
version: 1.0.0
date-released: 2025-01-30
url: "https://github.com/PathoGenOmics-Lab/ena-spike-ntd-repdel-analysis"
preferred-citation:
type: article
authors:
- family-names: "Álvarez-Herrera"
given-names: "Miguel"
- family-names: "Ruiz-Rodriguez"
given-names: "Paula"
- family-names: "Navarro-Domínguez"
given-names: "Beatriz"
- family-names: "Zulaica"
given-names: "Joao"
- family-names: "Grau"
given-names: "Brayan"
- family-names: "Bracho"
given-names: "María Alma"
- family-names: "Guerreiro"
given-names: "Manuel"
- family-names: "Aguilar-Gallardo"
given-names: "Cristóbal"
- family-names: "González-Candelas"
given-names: "Fernando"
- family-names: "Comas"
given-names: "Iñaki"
- family-names: "Geller"
given-names: "Ron"
- family-names: "Coscollá"
given-names: "Mireia"
doi: "10.1093/ve/veaf015"
journal: "Virus Evolution"
month: 3
pages: veaf015
title: "Genome data artifacts and functional studies of deletion repair in the BA.1 SARS-CoV-2 spike protein"
issue: 1
volume: 11
year: 2025
GitHub Events
Total
- Push event: 1
- Public event: 1
Last Year
- Push event: 1
- Public event: 1