https://github.com/cbg-ethz/lollipop

Deconvolution for Wastewater Genomics

https://github.com/cbg-ethz/lollipop

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    2 of 4 committers (50.0%) from academic institutions
  • Institutional organization owner
    Organization cbg-ethz has institutional domain (www.bsse.ethz.ch)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary

Keywords from Contributors

bioinformatics ngs sars-cov-2
Last synced: 7 months ago · JSON representation

Repository

Deconvolution for Wastewater Genomics

Basic Info
  • Host: GitHub
  • Owner: cbg-ethz
  • License: gpl-3.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 29.4 MB
Statistics
  • Stars: 3
  • Watchers: 8
  • Forks: 1
  • Open Issues: 7
  • Releases: 8
Created over 3 years ago · Last pushed 7 months ago
Metadata Files
Readme License

README.md

LolliPop

Bioconda package Docker container bio.tools Tests

LolliPop - a tool for Deconvolution for Wastewater Genomics

The LolliPop tool is part of the V-pipe workflow for analysing NGS data of short viral genomes.

Description

Wastewater-based monitoring has become an increasingly important source of information on the spread of SARS-CoV-2 variants since clinical tests are declining and may eventually disappear.

LolliPop has been developed to improve wastewater-based genomic surveillance as the number of variants of concern increased and to account for shared mutations among variants. It relies on a kernel-based deconvolution, and leverages the time series nature of the samples. This approach enables to generate higher confidence relative abundance curves despite the very high noise and overdispersion present in wastewater samples.

It has been integrated in conjunction with COJAC into V-pipe, a workflow designed for the analysis of next generation sequencing (NGS) data from viral pathogens. These tools now form the basis of the SARS-CoV-2 wastewater genomic surveillance commissioned by the Swiss Federal Office of Public Health, a cornerstone of the COVID-19 pandemic surveillance in Switzerland. This surveillance covers daily samples at ten wastewater treatment plants across Switzerland from February 2021 onward, and delivers weekly updates of the variants relative abundance curves.

Usage

Notebooks

LolliPop provides several classes that can be used imported in Jupyter notebooks python from lollipop import *

See notebook WwSmoothingKernel.ipynb in directory preprint/

Command line

Here are the available command-line tools:

| command | purpose | | :--------------------------- | :------ | | lollipop generate-mutlist | Generate the mutlist used when looking for variant using variant signatures | | lollipop getmutations from-basecount | Search a single sample for mutations and retrieve frequency from a TSV table of per-position base counts produced by V-pipe | | lollipop deconvolute | Run the deconvolution on a timeline of mutations |

Use option -h / --help to see available command-line options:

```console $ lollipop generate-mutlist --help Usage: lollipop generate-mutlist [OPTIONS] VOC_YAML

Generate the mutlist used when looking for variant using variant signatures

Options: -o, --output, --out TSV Write results to this output TSV instead of 'mutlist.tsv' -p, --out-pangovars, --output-variants-pangolin YAML Write a YAML mapping shortnames/columnnames to the Pangolineages (useful to make the 'variants_pangolin' section of deconvolute's input configuration) -g, --genes GFF Add 'gene' column to table -d, --voc-dir PATH Scan directory for additional voc YAML files -v, --verbose / -V, --no-verbose Verbose (dumps table on terminal) -h, --help Show this message and exit. ```

```console $ lollipop getmutations from-basecount --help
Usage: lollipop getmutations from-basecount [OPTIONS] BASECOUNT

Search mutations and retrieve frequency from a TSV table produced by V-pipe

Options: -o, --outname, --output PATH Filename of the final output table. If not provided, it defaults to mutations.txt -m, --muttable, --mutationtable PATH Mutations helper table -a, --based INTEGER Are the positions in the tsv 0-based or 1-based? Argument used for simple concatenation: These options allows subsequently building simply by concatenation (using xsv, or even tail & head) -l, --location TEXT Location of this sample -d, --date TEXT Date of this sample Argument use for V-pipe integration: These options help tracking output to the 2-level samples structure used by V-pipe -s, --sample, --samplename TEXT 'samplename' as found in the first column of the V-pipe samples.tsv -b, --batch TEXT 'batch'/'date' as in the second column of the V-pipe samples.tsv -h, --help Show this message and exit. ```

```console $ lollipop deconvolute --help Usage: lollipop deconvolute [OPTIONS] TALLY_TSV

Deconvolution for Wastewater Genomics

Options: -o, --output CSV Write results to this output CSV instead of 'deconvolved.csv' -C, --fmt-columns Change output CSV format to one column per variant (normally, variants are each on a separate line) --out-json, --oj JSON Also write a JSON results for upload to Cov- spectrum, etc. -c, --variants-config, --var YAML Variants configuration used during deconvolution [required] --variants-dates, --vd YAML Variants to scan per periods (as determined with cojac) -k, --deconv-config, --dec YAML Configuration of parameters for kernel deconvolution [required] -l, --loc, --location, --wwtp, --catchment NAME Name(s) of location/wastewater treatment plant/catchment area to process -fl, --filters YAML List of filters for removing problematic mutations from tally -s, --seed SEED Seed the random generator -n, --n-cores N Cores for parallel processing for multiple locations, defaults to 1 for sequential processing -nf, --namefield COLUMN column to use as 'names' for the entries in tally table. By default, if 'pos' and 'base' exist a column 'mutations' will be created and used as name. -h, --help Show this message and exit. ```

Howto

Input data requirements

Analysis can be performed on virus samples sequenced with most tiled multiplexed PCRs amplification protocols. Having coverage across the whole genome of the virus increases the chance of some variant-specific mutations being picked up and increasing the confidence, even if dropouts are experienced on some other regions of the genome (e.g.: dropouts on the fragment carrying the binding domain).

Sampling dates are important information to keep track of because LolliPop leverages time series.

Mutations lists

Analysis will use variants description YAML that lists mutations to be searched -- the same YAMLs as used by COJAC. You can refer to COJAC's commands cojac sig-generate to help generate exhaustive lists from requests on Cov-Spectrum or TSV files of Covariants.org, or cojac phe2cojac to import ready-made manually-curated lists from YMLs available at PHE Genomic's Standardised Variant Definitions.

Generate a list of mutation to be searched: bash lollipop generate-mutlist --output mutlist.tsv --out-pangovars variants_pangolin.yaml --genes Genes_NC_045512.2.GFF3 -- vocs/delta_mutations_full.yaml vocs/omicron_ba1_mutations_full.yaml vocs/omicron_ba2_mutations_full.yaml - Annotating the list with a GFF file is optional: Lollipop's deconvolution does not use genes information, but it could be useful for downstream visualizations. - --out-pangovars writes a table mapping back short names to full Pangolineages. It can be useful to help write (or be used in lieu of) a variants' config.

Search mutations in a single sample

basecount table

By default, LolliPop searches the mutations into a basecount TSV, a table that gives per position coverage of each A, T, C, G bases and deletion. V-pipe generates such a TSV using smallgenomeutilities's command aln2basecnt, you can use it in your workflow when starting from alignments:

bash aln2basecnt --first 1 --basecnt sample1.basecnt.tsv.gz --coverage sample1.coverage.tsv.gz --name "sample1" sample1.bam - --first is used to specify if the positions in the TSV are 1-based (like samtools) or 0-based (like pysam).

Then, search this TSV files for the mutations from the list generated above: bash lollipop getmutations from-basecount --based 1 --output sample1.mut.tsv --location "main plant" --date "2023-02-27" -m mutlist.tsv -- sample1.basecnt.tsv.gz - options --location and --date are a straightforward way to add the time series information for each sample

VCF and coverage

(a future version of LolliPop will be extended to support VCFs and coverage TSV as a more standard input)

Combine the time series

Once the above step has been run on every single sample of the cohort, combine all individual samples into a single heatmap-like object tracking the mutation overtime across all samples. This can be done by concatenating all the per-sample mutations TSVs with a tool such as xsv:

bash xsv cat rows --output tallymut.tsv sample*.mut.tsv

  • If you have not tagged each individual sample with --location and --date, now it would be a good time to add extra columns to tallymut.tsv, e.g., with a join operation.
  • Note that this file can get quite huge. It is possible to compress it on the fly: … | xsv fmt --out-delimiter '\t' | gzip -o tallymut.tsv.gz

Note: The xsv utility is not included with LolliPop. You can install it separately using either of the following methods: - With Homebrew: brew install xsv - With conda: conda install -c conda-forge xsv

Run the deconvolution

The deconvolution can now be run on this table

Kernel deconvolution config

Various aspects of the kernel-based deconvolution can be set with a YAML file: type of kernel (box vs Gaussian) and its parameters (such as bandwidth), regressor used, using bootstrapping to generate confidence value, estimating confidence intervals with Wald, computing the estimates on a logit scale, etc.

Various presets are available in the presets/ subdirectory.

For example: ```yaml kernel: 'gaussian' kernel_params: bandwidth: 10

regressor: 'robust'

deconvparams: mintol: 1e-3 ```

Variants configuration

This file controls the data set that the deconvolution runs on. At minimum, it should have a section mapping the short names back to full Pangolineages. This can be copied by the file generated with --out-pangovars on the first step (or that file reused as-is).

But this can also be used to optionally specify time limits (start_date and/or end_date), the subset of variants (variants_list) or locations (locations_list) to run deconvolution onto, variants column to delete (variants_not_reported) before processing any further, not considering the deletions (remove_deletions), etc. see example in config_preprint.yaml.

Variants dates

The deconvolution performs much better if only the variants known to be present in the mixture are considered. For longer-running experiment, it is therefore possible to specify, for different time periods, the list of variants to consider for deconvolution, based on their previous detection with a sensitive tool, e.g, such as determined running COJAC and looking for amplicons carrying mutations combinations which are exclusive for certain variants.

For example: yaml var_dates: '2022-06-15': - BA.1 - BA.2 - BA.4 - BA.5 - BA.2.75 '2022-08-15': - BA.4 - BA.5 - BA.2.75 - BQ.1.1 '2022-11-01': - BA.4 - BA.5 - BA.2.75 - BQ.1.1 - XBB see variantsdatesexample.yaml.

Filters (optional)

Some mutations might be problematic and need to be taken out -- e.g. due to drop-outs in the multiplex PCR amplification, they do not show up in the data and this could be misinterpreted by LolliPop as proof of absence of a variant. This optional file contains a collection of filters. Each filter has a list of statements with the following syntax: text - <column> <op> <value> Valid op are: - == on that line, the value in column is exactly value - for simple strings this can be omitted: - proto v3 is synonymous with - proto == v3 - <= the value is less than or equal to value - >= the value is greater than or equal to value - < the value is less than value - > the value is greater than value - != the value is not value - in the value is found in the list specidied in value - ~ the value matches the regular expression in value - regex can be quoted using / or @ - !~ the vlue does not matche the regular expression in value

Any arbitrary column found in the input file can be used.

All statements of a filter are combined with a logical and and matching lines are removed from the tally table.

Filters are processed in the order found in the YAML file.

For example: ```yaml

filter to remove test samples

remove_test: - sample ~ /^Test/

filter to remove an amplicon that has drop-outs

amplicon75: - proto v3 - date > 2021-11-20 - pos >= 22428 - pos <= 22785 ``` see example in filters_preprint.yaml.

Running it

bash lollipop deconvolution --output=deconvoluted.tsv --out-json=deconvoluted_upload.json --var=variants_conf.yaml --vd=variants_dates.yaml --dec=deconv_linear.yaml --seed=42 --n-cores=8 -- tallymut.tsv

Output

The output is tabular:

| location | date | variant | proportion | | :--------- | :--------- | :------ | ---------: | | main plant | 2023-02-27 | BA.4 | 0.000 |

Optionally, LolliPop can also package the results in a JSON structure, e.g., to be sent to online dashboards:

json { "mainplant": { "BA.4": { "timeseriesSummary": [ { "date": "2023-02-27", "proportion": 0.000 }, { "date": " … etc … " } ] } } }

The repository cowwid contains real-world examples of downstream analysis of the output of LolliPop.

Installation

We recommend using bioconda software repositories for easy installation. You can find instructions to setup your bioconda environment at the following address:

  • https://bioconda.github.io/index.html#usage

Prebuilt package

LolliPop and its dependencies are all available in the bioconda repository. We strongly advise you to install this pre-built package for a hassle-free experience.

You can install lollipop in its own environment and activate it:

```bash conda create -n lollipop lollipop conda activate lollipop

test it

lollipop --help ```

And to update it to the latest version, run:

```bash

activate the environment if not already active:

conda activate lollipop conda update lollipop ```

Or you can add it to the current environment (e.g.: in environment base):

bash conda install lollipop

Building and deploying yourself

within conda environment

If you want to install the software yourself, you can see the list of dependencies in conda_lollipop_env.yaml.

We recommend using conda to install them:

bash conda env create -f conda_lollipop_env.yaml conda activate lollipop

Install lollipop using pip: ```bash

install both the python module and the cli

pip install '.[cli]'

(this will autodetect dependencies already installed by conda)

```

The command lollipop should now be accessible from your PATH

```bash

activate the environment if not already active:

conda activate lollipop lollipop --help ```

Remove conda environment

You can remove the conda environment if you do not need it any more:

```bash

exit the lollipop environment first:

conda deactivate conda env remove -n lollipop ```

Python poetry

LolliPop has its dependencies in a pyproject.toml managed with poetry and can be installed with it.

```bash

If not installed system-wide: manually run poetry-dynamic-versioning

poetry-dynamic-versioning

(this sets the version string from the git currently cloned and checked out)

poetry install --extras "cli" ```

For development install all with:

sh poetry install --with dev --extras "cli" poetry run pre-commit install

This will ensure you have all tools needed for development, including the pre-commit hook for automatic code formatting with black.

Upcoming features

  • [ ] Support VCFs and coverage TSV as alternative to basecount TSV

Long term goal:

~~- [x] Inputs other than SNVs: can deconvolute COJAC's output tables~~

Contributions

Package developers:

Corresponding author:

Citation

If you use this software in your research, please cite:

  • David Dreifuss, Ivan Topolsky, Pelin Icer Baykal & Niko Beerenwinkel

"Tracking SARS-CoV-2 genomic variants in wastewater sequencing data with LolliPop."

medRxiv; doi:10.1101/2022.11.02.22281825

Contacts

If you experience problems running the software:

Owner

  • Name: Computational Biology Group (CBG)
  • Login: cbg-ethz
  • Kind: organization
  • Location: Basel, Switzerland

Beerenwinkel Lab at ETH Zurich

GitHub Events

Total
  • Create event: 14
  • Commit comment event: 1
  • Release event: 2
  • Issues event: 5
  • Delete event: 10
  • Issue comment event: 29
  • Push event: 59
  • Pull request review comment event: 2
  • Pull request review event: 9
  • Pull request event: 18
Last Year
  • Create event: 14
  • Commit comment event: 1
  • Release event: 2
  • Issues event: 5
  • Delete event: 10
  • Issue comment event: 29
  • Push event: 59
  • Pull request review comment event: 2
  • Pull request review event: 9
  • Pull request event: 18

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 95
  • Total Committers: 4
  • Avg Commits per committer: 23.75
  • Development Distribution Score (DDS): 0.474
Past Year
  • Commits: 56
  • Committers: 3
  • Avg Commits per committer: 18.667
  • Development Distribution Score (DDS): 0.482
Top Committers
Name Email Commits
Ivan Blagoev Topolsky i****y@b****h 50
Gordon J. Köhn g****n@k****t 29
dr-david d****s@g****m 10
Gordon J. Köhn g****n@d****h 6
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 4
  • Total pull requests: 18
  • Average time to close issues: 2 months
  • Average time to close pull requests: 20 days
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 1.5
  • Average comments per pull request: 1.11
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 6
Past Year
  • Issues: 3
  • Pull requests: 18
  • Average time to close issues: 5 months
  • Average time to close pull requests: 20 days
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.33
  • Average comments per pull request: 1.11
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 6
Top Authors
Issue Authors
  • gordonkoehn (4)
  • skunklem (1)
Pull Request Authors
  • gordonkoehn (14)
  • dependabot[bot] (6)
Top Labels
Issue Labels
enhancement (1) bug (1)
Pull Request Labels
dependencies (6) enhancement (4) github_actions (3) python (3) bug (1)

Dependencies

pyproject.toml pypi
  • black ^22.1.0 develop
  • click ^8.0
  • click-option-group ^0.5
  • numpy >=1.23
  • pandas >=1.5
  • python ^3.10
  • ruamel.yaml >=0.15.80
  • scipy >=1.9
  • strictyaml >=1.7
  • tqdm >=4.64
.github/workflows/main.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
  • snok/install-poetry v1 composite