ldweaver

Genomewide Epistasis Analysis on Bacteria

https://github.com/sudaraka88/ldweaver

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov, nature.com, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.0%) to scientific vocabulary

Keywords

bacterial-genome-analysis bioinformatics epistasis-analysis linkage-disequilibrium microbial-genomics

Last synced: 9 months ago · JSON representation ·

Repository

Genomewide Epistasis Analysis on Bacteria

Basic Info

Host: GitHub
Owner: Sudaraka88
License: gpl-3.0
Language: R
Default Branch: main
Homepage:
Size: 108 MB

Statistics

Stars: 14
Watchers: 2
Forks: 0
Open Issues: 1
Releases: 8

Topics

bacterial-genome-analysis bioinformatics epistasis-analysis linkage-disequilibrium microbial-genomics

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Genomewide Co-selection and Epistasis in Bacteria

LDWeaver accepts a sequence alignment (fasta) and its reference annotation (genbank or gff) as inputs and identifies linkage disequilibrium (LD) between pairs of variants (links) that is unusually high given the genomic distance between the pair. This high LD could be the result of co-selection or epistasis. Approximate statistical significance is used to rank outlier links and the output is reported in tsv format, along with several other helpful annotations and figures. Additionally, LDWeaver has functions to assist the detection of genomic regions that have potentially undergone co-selection or epistasis. LDWeaver tsv output can be directly used as input for GWESExplorer for dynamic link visualisation.

Installation

Using `bioconda`

First, you need to install conda.

Next, create a new environment and activate it: conda conda create -n r-ldweaver conda activate r-ldweaver Finally, install the package from bioconda. conda install -c conda-forge -c bioconda r-ldweaver Consider using mamba for a faster installation.

Using `devtools`

The latest version of LDWeaver is available on github. It can be installed with devtools.

r install.packages("devtools") devtools::install_github("Sudaraka88/LDWeaver")

Quick Start

To run LDWeaver, all you need is a fasta alignment (can be gz compressed) and the corresponding genbank (preferred, or gff3 also supported) annotation file with the reference sequence. You can also use a SNP-only alignment (can be generated from snp-sites or FastaR) with a numeric vector of SNP positions as per the reference genome.

A toy dataset is provided in the package itself to confirm that everything is setup correctly.

Note For GWES analysis, it is recommended to use an alignment with a large number of sequences (>500). Such alignments can be several gigabytes in size and not suitable to bundle into an R package. A proper example using real data is available below.

The toy alignment comprises the first 50kb of 400 S. pnuemoniae isolates (randomy selected from the 616 reported here). The reads were aligned against the ATCC 700669 reference genome using snippy. Since this is only a part of an alignment, we set validate_ref_ann_lengths = F, which forces LDWeaver to ignore the mismatch in sequence lengths between the genbank reference sequence and the alignment. Several additional options are also used, see help(package="LDWeaver") for more details.

``` r

devtools::install_github("Sudaraka88/LDWeaver")

library(LDWeaver) dset <- "sample" alnpath <- system.file("extdata", "sample.aln.gz", package = "LDWeaver") gbkpath <- system.file("extdata", "sample.gbk", package = "LDWeaver") snpfiltmethod = "relaxed" LDWeaver(dset = dset, alnpath = alnpath, gbkpath = gbkpath, validaterefannlengths = F, numclustsCDS = 2, SnpEffAnnotate = F, snpfiltmethod = snpfiltmethod) ```

Note If you are using a SNP-only alignment, set aln_has_all_bases = F and provide pos, a numeric vector of SNP positions. Each SNP in the SNP-only alignment must have a unique SNP position.

Citation

Please cite LDWeaver using: Mallawaarachchi, S. et al. Detecting co-selection through excess linkage disequilibrium in bacterial genomes, NAR genom. bioinform., 6(2),lqae061 (2024): https://doi.org/10.1093/nargab/lqae061

Detailed Workthrough using Real Data

The following analysis demonstrates most of the options available in LDWeaver. The alignment with 616 S. pnuemoniae genomes is available here, and the same sample.gbk annotation was used to generate this alignment (also available here).

Note: Alternatively, you can download and use the SNP-only alignment with the accompanying positions file to generate the same results.

For this example, it is assumed that the current working directory is set to ~/LDWeaver_run and the alignment is available in the same folder. Please note that file paths here are written for Linux/macOS operating systems, windows users will need to modify as needed.

The following few lines of code can perform the complete LDWeaver analysis.

``` r library(LDWeaver)

WARNING! Possible runtime > 1h

setwd("~/LDWeaver_run")

dset <- "msch" alnpath <- "spn23fmsch.aln.gz" gbk_path <- system.file("extdata", "sample.gbk", package = "LDWeaver")

LDWeaver::LDWeaver(dset = dset, alnpath = alnpath, gbkpath = gbkpath, saveadditionaloutputs = T) ```

LDWeaver::LDWeaver() one-liner is versatile for most analyses. If previously created outputs are available in <dset>, this function will load those instead of repeating possibly time and resource heavy analysis.

It is also possible to write customised pipelines using available functions. For a full list of available functions and options, run: help(package="LDWeaver").

``` r library(LDWeaver) dset <- "msch" dir.create(dset) # folder to save outputs

alnpath <- "~/LDWeaverrun/spn23fmsch.aln.gz" gbkpath <- system.file("extdata", "sample.gbk", package = "LDWeaver") ncores = parallel::detectCores()

snp.dat = LDWeaver::parsefastaalignment(alnpath = alnpath) # parse the alignment and extract SNPs gbk = LDWeaver::parsegenbankfile(gbkpath = gbkpath, g = snp.dat$g) # parse the annotation cdsvar = LDWeaver::estimatevariationinCDS(gbk = gbk, snp.dat = snp.dat, ncores = ncores, numclustsCDS = 3, clustpltpath = "msch/CDS_clustering.png") ```

``` r hdw = LDWeaver::estimateHammingdistance_weights(snp.dat = snp.dat) # Hamming distance weights

Perform MI computation model fitting and ARACNE - this will take some time...

srlinks = LDWeaver::performMIcomputation(snp.dat = snp.dat, hdw = hdw, cdsvar = cdsvar, ncores = ncores, lrsavepath = "msch/lrlinks.tsv", srsavepath = "msch/srlinks.tsv", pltfolder = dset) ```

``` r

Generate GWES Plots (short-range)

LDWeaver::makegwesplots(srlinks = srlinks, plt_folder = dset) ```

``` r

Identify the top hits by performing snpEff annotations

tophits = LDWeaver::performsnpEffannotations(dsetname = dset, annotationfolder = file.path(getwd(), dset), gbk = gbk, gbkpath = gbkpath, cdsvar = cdsvar, linksdf = srlinks, snp.dat = snp.dat, tophitspath = "msch/srtophits.tsv") ```

This will generate several outputs comprising annotations into the <msch> folder, please see Performing Annotations.

``` r

Generate tanglegram

LDWeaver::createtanglegram(tophits = tophits, gbk = gbk, tanglegramfolder = "msch/SR_Tanglegram") ```

Above line will create 5 tanglegrams in html format, the first one should look like this:

``` r

Generate GWES Explorer outputs

LDWeaver::writeoutputforgwesexplorer(snp.dat = snp.dat, tophits = tophits, gwesexplorerfolder = "msch/SR_GWESExplorer") ```

Above line will create three files in <msch/SRGWESExplorer/> that can be used as inputs for <a href="https://github.com/jurikuronen/GWES-Explorer" target="blank">GWESExplorer. In addition, it is possible to provide a GFF3 annotation file, phenotype data and a core genome phylogeny. The circular GWESExplorer plot should look like this:

Next step is to analyse the long range links ``` r

Analyse long range links

LDWeaver::analyselongrangelinks(dset = dset, lrlinkspath = "msch/lrlinks.tsv", srlinkspath = "msch/srlinks.tsv", SnpEffAnnotate = T, snp.dat = snp.dat, gbkpath = gbkpath, cdsvar = cdsvar) ```

By this time, the msch folder is cluttered with a large number of outputs. LDWeaver::LDWeaver one-liner will automatically tidy up these files in to several folders. This can be done using the cleanup function.

r LDWeaver::cleanup(dset)

It is possible to generate a genomewide LD distribution map using the following: r LDWeaver::genomewide_LDMap(lr_links_path = "msch/Temp/lr_links.tsv", sr_links_path = "msch/Temp/sr_links.tsv", plot_save_path = "msch/GWLD.png")

Note The paths have now updated after running LDWeaver::cleanup().

A network plot for the pbp genes showcasing both the number of links between sites and their magnitude can be generated using:

``` r

Generate the Network Plot for pbp genes

network = LDWeaver::createnetworkforgene("pbp", srannotatedpath = "msch/Annotatedlinks/srlinksannotated.tsv", lrannotatedpath = "msch/Annotatedlinks/lrlinks_annotated.tsv", level = 2)

LDWeaver::createnetwork(network, plottitle = "pbp network", netplotpath = "msch/pbpnetwork.png", plotw = 2000, ploth = 2000) ```

Additional Information

Note With LDWeaver >1.5, you can analyse mega scale datasets with > 2^(32-1) elements. This requires spam and spam64 packages. Set mega_dset=T in LDWeaver::LDWeaver() to use this feature. Warning! This is currently considerably slower than the default mode (mega_dset=F) and only supports single core operations. There will also be minor discrepancies between the two methods due to floating point errors, however, this should only have a minimal impact on the final link ranking.

Key Outputs

If the above steps worked as expected, the following output will be saved to a folder called sample, which should be created in the current working directory. (Working directory can be queried using: getwd()).

Figures

sample/cX_fit.png - shows the distribution and modelling of the background linkage disequilibrium (estimated using weighted Mutual Information) vs. bp-separation within each cluster (X = 1,2 in the example)
sample/CDS_clustering.png - shows the genome partitioning, based on the CDS diversity (compared to the reference sequence)
sample/srgwesclust.png - short-range GWES plot for each cluster (2 in this case)
sample/srgwescombi.png - combined short-range GWES plot (for links with bp positions spanning two clusters, the max srp_value is used)
sample/lrgwes.png - Long range GWES plot (similar to the output from <a href="https://github.com/santeripuranen/SpydrPick" target="blank">SpydrPick)

Outputs

sample/srlinks.tsv - tab separated file containing details on short-range links (i.e. links <= srdist bp apart)
sample/lrlinks.tsv - tab separated file containing details on long-range links (i.e. links > srdist bp apart)

Extra Outputs

Note The default sr_dist value in LDWeaver is 20000bp (user modifiable). This means that by default, links from SNPs <20kb apart are considered short-range.

Additional Outputs (not generated) - can be used to avoid costly re-computations.

AdditionalOutputs/snpACGTN.rds - list comprising sparse SNP data from the alignment
AdditionalOutputs/parsedgbk.rds - GenBankRecord of the genbank annotation data
Additional_Outputs/hdw.rds - named vector comprising Hamming distance weights for each sequence
AdditionalOutputs/cdsvar.rds - list comprising alignment diversity information

Note For very large datsets, the user has the option to set save_additional_outputs=T. When these four files are present in <dset>/Additional_Outputs/, the saved information will be used instead of re-computing.

Performing Annotations

By default, LDWeaver performs detailed annotations on all link SNPs using SnpEff. This will create the following outputs in <dset>. Note that X here refers to sr (short range) or lr (long range).

Outputs

Annotatedlinks/Xlinksannotated.tsv - tab separated file similar to sample/Xlinks.tsv with additional SnpEff annotations and allele distribution information
Tophits/Xtophits.tsv - tab separated file containing the top 250 links (user modifiable with `maxtophipts`) . Several filters are applied to extract the top links from Annotatedlinks/Xlinks_annotated.tsv
SR_Tanglegram - folder compirising html tanglegrams to easily visualise links and the corresponding genomic regions
GWESExplorer/XGWESExplorer - folder containing the outputs necessary to dynamically explore links using <a href="https://github.com/jurikuronen/GWES-Explorer" target="blank">GWESExplorer (X = sr,lr).

Note The default srpcutoff is 3 (i.e., p=0.001). Short-range links with p>0.001 are automatically discarded, this can be modified using the <srpcutoff> option. The default maxtophits value is 250, this can be modified using the <maxtophits> option.

Temporary files created during snpEff annotations. These are all written to <dset>/Temp and can be ignored or safely deleted)

Temp/snpEff_data - data folder for snpEff
Temp/snpEff.config - configuration file for snpEff
Temp/Xannotations.tsv - tab separated file containing full snpEff annotations on each site associated with a short-range GWES link with srpmax > srp_cutoff
Temp/Xannotataedstats.genes.txt - annotations and statistics in tab separated format
Temp/Xannotatedstats.html - annotations and statistics in html format
Temp/Xsnps.vcf, Temp/Xsnps_ann.vcf - input and output from the snpEff annotation pipeline

Owner

Name: Sudaraka Mallawaarachchi
Login: Sudaraka88
Kind: user
Location: Melbourne, Australia
Company: University of Melbourne

Repositories: 1
Profile: https://github.com/Sudaraka88

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: LDWeaver
message: ' If you find LDWeaver helpful, please cite:'
type: software
authors:
  - given-names: Sudaraka
    family-names: Mallawa Arachchi
    email: smallawaarachchi@gmail.com
    orcid: 'https://orcid.org/0000-0001-8899-3323'
    affiliation: University of Oslo and University of Melbourne
identifiers:
  - type: doi
    value: 10.5281/zenodo.10016711
    description: Zenodo
repository-code: 'https://github.com/Sudaraka88/LDWeaver/'
abstract: >-
  LDWeaver accepts a sequence alignment (fasta) and its
  reference annotation (genbank or gff) as inputs and
  identifies linkage disequilibrium (LD) between pairs of
  variants (links) that is unusually high given the genomic
  distance between the pair. This high LD could be the
  result of co-selection or epistasis.
keywords:
  - Microbial Genomics
  - Linkage Disequilibrium
  - Epistasis
  - Gene Interactions
license: GPL-3.0-or-later
version: v1.5
date-released: '2024-01-04'

GitHub Events

Total

Create event: 3
Release event: 1
Issues event: 3
Watch event: 2
Delete event: 2
Issue comment event: 7
Push event: 10
Pull request event: 1

Last Year

Create event: 3
Release event: 1
Issues event: 3
Watch event: 2
Delete event: 2
Issue comment event: 7
Push event: 10
Pull request event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 2
Total pull requests: 1
Average time to close issues: 23 days
Average time to close pull requests: 29 minutes
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 2.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 1
Average time to close issues: 23 days
Average time to close pull requests: 29 minutes
Issue authors: 2
Pull request authors: 1
Average comments per issue: 2.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

braddmg (1)
cammmer (1)

Pull Request Authors

Sudaraka88 (5)

Top Labels

Issue Labels

bug (1)

Pull Request Labels

Dependencies

.github/workflows/r.yml actions

actions/checkout v3 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

DESCRIPTION cran

R >= 4.0.0 depends
GenomicRanges * imports
Matrix * imports
MatrixExtra * imports
RColorBrewer * imports
Rcpp * imports
RcppArmadillo * imports
ape * imports
chromoMap * imports
data.table * imports
dplyr * imports
fitdistrplus * imports
genbankr * imports
ggnewscale * imports
ggplot2 * imports
ggraph * imports
ggtree * imports
grDevices * imports
heatmap3 * imports
htmlwidgets * imports
igraph * imports
methods * imports
parallel * imports
phytools * imports
plyr * imports
stats * imports
utils * imports

ldweaver

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Genomewide Co-selection and Epistasis in Bacteria

Installation

Using bioconda

Using devtools

Quick Start

devtools::install_github("Sudaraka88/LDWeaver")

Citation

Detailed Workthrough using Real Data

WARNING! Possible runtime > 1h

Perform MI computation model fitting and ARACNE - this will take some time...

Generate GWES Plots (short-range)

Identify the top hits by performing snpEff annotations

Generate tanglegram

Generate GWES Explorer outputs

Analyse long range links

Generate the Network Plot for pbp genes

Additional Information

Key Outputs

Extra Outputs

Performing Annotations

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Using `bioconda`

Using `devtools`