ena-spike-ntd-repdel-analysis

A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. Manuscript under review.

https://github.com/pathogenomics-lab/ena-spike-ntd-repdel-analysis

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 10 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.2%) to scientific vocabulary

Keywords

bioinformatics pipeline sars-cov-2 snakemake

Last synced: 8 months ago · JSON representation ·

Repository

A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. Manuscript under review.

Basic Info

Host: GitHub
Owner: PathoGenOmics-Lab
License: gpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 266 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 1

Topics

bioinformatics pipeline sars-cov-2 snakemake

Created over 1 year ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

ENA spike NTD repaired deletion analysis

A Snakemake workflow with associated scripts used for detecting spike NTD repaired deletions in SARS-CoV-2 Omicron BA.1 lineage reads. The workflow processes sequencing data retrieved from the ENA Portal API, performing quality filtering, read mapping, variant calling, and classification of the deletion repair genotype. This pipeline was developed as part of a larger study.

Results generated with this pipeline are available via DOI: 10.20350/digitalCSIC/17032. We ran Snakemake v8.25.3 with Python v3.12.7.

Workflow summary

This pipeline fetches and processes SARS-CoV-2 "read run" ENA records with a sample collection date between 1 November 2021 and 1 August 2022, filtered for NCBI taxonomy code 2697049 (SARS-CoV-2) and Homo sapiens host, excluding sequencing platforms DNBseq, Element and capillary sequencing, and RNAseq, transcriptomic, metagenomic, and metatranscriptomic library strategies. Then, the following steps are run for each resulting record:

FASTQ retrieval via the ENA metadata FTP URLs.
Read preprocessing and quality filtering using fastp v0.23.4.
Read mapping with minimap2 v2.28 against a BA.1 reference genome (GenBank: OZ070629.1) using the recommended presets depending on the run sequencing platform.
Consensus genome generation with samtools v1.20 and iVar v1.4.3.
Lineage assignment using pangolin v4.3.
Variant calling with iVar v1.4.3, annotated with SnpEff v5.2 and filtered with SnpSift v5.2.
Classification in three "haplotypes", based on the presence or absence of S gene deletions ΔH69/V70 and ΔV143/Y145. Alleles are encoded as insertions in HGVS nomenclature, given the reference genome:
- Rep_69_70: repair of S:ΔH69/V70 (S:p.Val67_Ile68dup detected, S:p.Asp140_His141insValTyrTyr absent).
- Rep_143_145: repair of S:ΔV143/Y145 (S:p.Asp140_His141insValTyrTyr detected, S:p.Val67_Ile68dup absent).
- Rep_Both: repair of both deletions (S:p.Val67_Ile68dup and S:p.Asp140_His141insValTyrTyr detected).
Data summarization and visualization using ape v5.8, Rsamtools v2.18.0, tidyverse v2.0.0, and ggpubr v0.6.0 in R v4.3.3.

Usage

This repository contains a Snakemake workflow for processing sequencing data from FASTQ retrieval to classification and result summarization. The pipeline is conceptualized in two main sections: (1) an independent, linear processing pipeline for each record, and (2) summarization tasks that aggregate results and generate reports. Due to the large dataset size, a LIGHT configuration flag is available to execute only the first section of the DAG, reducing computational load. This also enables a batcher rule that allows execution using Snakemake batches if needed.

1. Data retrieval and chunking

00a_run_search.sh: queries the ENA Portal API to retrieve sequencing records.
00b_generate_chunks.sh: splits survey results into manageable chunks for processing via SLURM job arrays. The dataset is divided into 16 groups, each containing up to 5000 chunks, with each chunk holding 16 records. This approach addressed Snakemake limitations when handling large DAGs at the time of execution. Chunk settings were set considering our HPC resource limits.

2. Haplotype classification

01_run_haplotypes_array_chunked.sh <group number>: runs the analysis for a specified chunk group. Each execution launches up to 5000 SLURM jobs through a job array. This step must be run for each group. For the final manuscript, only a subset of groups was analyzed. This step is executed with LIGHT=True. Most parameters can be tweaked via the Snakemake config.yaml file.

3. Summary and reporting

02_run_complete.sh: Executes the full workflow to generate summary tables, visualizations, and reports. This step is executed with LIGHT=False. Given computational constraints, the final manuscript analysis used scripts/summarize_results.py instead, which parses result files to produce a summary table with key measurements, and data visualizations were created manually using the same code integrated into the workflow.

Citation

Álvarez-Herrera M, Ruiz-Rodriguez P, Navarro-Domínguez B, Zulaica J, Grau B, Bracho MA, Guerreiro M, Aguilar-Gallardo C, González-Candelas F, Comas I, Geller R & Coscollá M (2025). Genome data artifacts and functional studies of deletion repair in the BA.1 SARS-CoV-2 spike protein. Virus Evolution, 11(1), veaf015. https://doi.org/10.1093/ve/veaf015

Owner

Name: PathoGenOmics Lab
Login: PathoGenOmics-Lab
Kind: organization
Location: Spain

Website: https://www.uv.es/pathogenomic
Twitter: gen_UV
Repositories: 1
Profile: https://github.com/PathoGenOmics-Lab

Citation (CITATION.cff)

cff-version: 1.2.0
message: "This pipeline was developed as part of a larger study. If you use this software, please cite it as below."
authors:
  - family-names: "Álvarez-Herrera"
    given-names: "Miguel"
    orcid: "https://orcid.org/0000-0002-7922-3180"
  - family-names: "Ruiz-Rodriguez"
    given-names: "Paula"
    orcid: "https://orcid.org/0000-0003-0727-5974"
title: "ENA spike NTD repaired deletion analysis"
version: 1.0.0
date-released: 2025-01-30
url: "https://github.com/PathoGenOmics-Lab/ena-spike-ntd-repdel-analysis"
preferred-citation:
  type: article
  authors:
  - family-names: "Álvarez-Herrera"
    given-names: "Miguel"
  - family-names: "Ruiz-Rodriguez"
    given-names: "Paula"
  - family-names: "Navarro-Domínguez"
    given-names: "Beatriz"
  - family-names: "Zulaica"
    given-names: "Joao"
  - family-names: "Grau"
    given-names: "Brayan"
  - family-names: "Bracho"
    given-names: "María Alma"
  - family-names: "Guerreiro"
    given-names: "Manuel"
  - family-names: "Aguilar-Gallardo"
    given-names: "Cristóbal"
  - family-names: "González-Candelas"
    given-names: "Fernando"
  - family-names: "Comas"
    given-names: "Iñaki"
  - family-names: "Geller"
    given-names: "Ron"
  - family-names: "Coscollá"
    given-names: "Mireia"
  doi: "10.1093/ve/veaf015"
  journal: "Virus Evolution"
  month: 3
  pages: veaf015
  title: "Genome data artifacts and functional studies of deletion repair in the BA.1 SARS-CoV-2 spike protein"
  issue: 1
  volume: 11
  year: 2025

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science