covid19

SARS-CoV-2 analysis pipeline for short-read, paired-end illumina sequencing

https://github.com/tobiasrausch/covid19

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary

Keywords

consensus covid19 covid19-analysis sars-cov-2 variant-calling whole-genome-sequencing

Last synced: 6 months ago · JSON representation

Repository

SARS-CoV-2 analysis pipeline for short-read, paired-end illumina sequencing

Basic Info

Host: GitHub
Owner: tobiasrausch
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage:
Size: 324 MB

Statistics

Stars: 6
Watchers: 3
Forks: 1
Open Issues: 1
Releases: 5

Topics

consensus covid19 covid19-analysis sars-cov-2 variant-calling whole-genome-sequencing

Created about 5 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Zenodo

SARS-CoV-2 data analysis

SARS-CoV-2 analysis pipeline for short-read, paired-end sequencing.

Installation

A Makefile is part of the code that installs all dependencies using bioconda.

git clone --recursive https://github.com/tobiasrausch/covid19.git

cd covid19

make all

Preparing the reference databases and indexes

There is a script to download and index the SARS-CoV-2 and GRCh38 reference sequence.

cd ref/ && ./prepareREF.sh

There is another script to prepare the kraken2 human database to filter host reads.

cd kraken2/ && ./prepareDB.sh

Running the data analysis pipeline

There is a run script that performs adapter trimming, host read removal, alignment, variant calling and annotation, consensus calling and some quality control. The last parameter, called unique_sample_id, is used to create a unique output directory in the current working directory.

./src/run.sh <read.1.fq.gz> <read.2.fq.gz> <unique_sample_id>

Output

The main output files are:

The adapter-trimmed and host-filtered FASTQ files: ls <unique_sample_id>.filtered.R_[12].fq.gz
The alignment to SARS-CoV-2: ls <unique_sample_id>.srt.bam
The consensus sequence: ls <unique_sample_id>.cons.fa
The annotated variants: ls <unique_sample_id>.variants.tsv
The assigned lineage: ls <unique_sample_id>.lineage.csv
The summary QC report: ls <unique_sample_id>.qc.summary

Aggregating results

The above pipeline generates a report for every sample. It can be naively parallelized on the sample level. You can then aggregate all the QC information and the lineage & clade assignments using

./src/aggregate.sh outtable */*.qc.summary

Estimating cross-contamination

You can estimate cross-contamination based on the allelic frequencies of variant calls using

./src/crosscontam.sh contam */*.bcf

This works best on good quality consensus sequences, i.e.:

./src/crosscontam.sh contamgrep "RKI pass" /.qc.summary | sed 's/.qc.summary.*$/.bcf/' | tr '\n' ' '`

Example

The repository contains an example script using a COG-UK data set.

cd example/ && ./expl.sh

Citation

Evolution of SARS-CoV-2 in the Rhine-Neckar/Heidelberg Region 01/2021 - 07/2023. Infect Genet Evol. 2024 Feb 23:119:105577. DOI: 10.1016/j.meegid.2024.105577

Credits

Many thanks to the open-science of COG-UK, their data sets in ENA were very useful to develop the code. The workflow uses many tools distributed via bioconda, please see the Makefile for all the dependencies and of course, thanks to all the developers.

Owner

Name: Tobias Rausch
Login: tobiasrausch
Kind: user
Location: Germany
Company: EMBL

Website: tobiasrausch.com
Twitter: tobias_757
Repositories: 7
Profile: https://github.com/tobiasrausch

Researcher in Computational Genomics

GitHub Events

Total

Push event: 1

Last Year

Push event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 3
Total pull requests: 0
Average time to close issues: 4 months
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 0
Average comments per issue: 0.67
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science