Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov, nature.com, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.0%) to scientific vocabulary
Keywords
Repository
Genomewide Epistasis Analysis on Bacteria
Basic Info
Statistics
- Stars: 14
- Watchers: 2
- Forks: 0
- Open Issues: 1
- Releases: 8
Topics
Metadata Files
README.md
Genomewide Co-selection and Epistasis in Bacteria
LDWeaver accepts a sequence alignment (fasta) and its reference annotation
(genbank or gff) as inputs and identifies linkage disequilibrium (LD) between
pairs of variants (links) that is unusually high given the genomic distance
between the pair. This high LD could be the result of co-selection or epistasis.
Approximate statistical significance is used to rank outlier links and the
output is reported in tsv format, along with several other helpful annotations and figures.
Additionally, LDWeaver has functions to assist the detection of genomic regions
that have potentially undergone co-selection or epistasis. LDWeaver tsv output
can be directly used as input for
GWESExplorer
for dynamic link visualisation.
Installation
Using bioconda
First, you need to install conda.
Next, create a new environment and activate it:
conda
conda create -n r-ldweaver
conda activate r-ldweaver
Finally, install the package from bioconda.
conda install -c conda-forge -c bioconda r-ldweaver
Consider using mamba for a faster installation.
Using devtools
The latest version of LDWeaver is available on github. It can be installed with devtools.
r
install.packages("devtools")
devtools::install_github("Sudaraka88/LDWeaver")
Quick Start
To run LDWeaver, all you need is a fasta alignment (can be gz compressed) and the corresponding genbank (preferred, or gff3 also supported) annotation file with the reference sequence. You can also use a SNP-only alignment (can be generated from snp-sites or FastaR) with a numeric vector of SNP positions as per the reference genome.
A toy dataset is provided in the package itself to confirm that everything is setup correctly.
Note For GWES analysis, it is recommended to use an alignment with a large number of sequences (>500). Such alignments can be several gigabytes in size and not suitable to bundle into an R package. A proper example using real data is available below.
The toy alignment comprises the first 50kb of 400 S. pnuemoniae
isolates (randomy selected from the 616 reported
here).
The reads were aligned against the
ATCC
700669 reference genome using
snippy.
Since this is only a part of an alignment, we set
validate_ref_ann_lengths = F, which forces LDWeaver to ignore the
mismatch in sequence lengths between the genbank reference sequence and the
alignment. Several additional options are also used, see
help(package="LDWeaver") for more details.
``` r
devtools::install_github("Sudaraka88/LDWeaver")
library(LDWeaver) dset <- "sample" alnpath <- system.file("extdata", "sample.aln.gz", package = "LDWeaver") gbkpath <- system.file("extdata", "sample.gbk", package = "LDWeaver") snpfiltmethod = "relaxed" LDWeaver(dset = dset, alnpath = alnpath, gbkpath = gbkpath, validaterefannlengths = F, numclustsCDS = 2, SnpEffAnnotate = F, snpfiltmethod = snpfiltmethod) ```
Note If you are using a SNP-only alignment, set
aln_has_all_bases = Fand providepos, a numeric vector of SNP positions. Each SNP in the SNP-only alignment must have a unique SNP position.
Citation
Please cite LDWeaver using: Mallawaarachchi, S. et al. Detecting co-selection through excess linkage disequilibrium in bacterial genomes, NAR genom. bioinform., 6(2),lqae061 (2024): https://doi.org/10.1093/nargab/lqae061
Detailed Workthrough using Real Data
The following analysis demonstrates most of the options available in LDWeaver. The alignment with 616 S. pnuemoniae genomes is available here, and the same sample.gbk annotation was used to generate this alignment (also available here).
Note: Alternatively, you can download and use the SNP-only alignment with the accompanying positions file to generate the same results.
For this example, it is assumed that the current working directory is set to
~/LDWeaver_run and the
alignment
is available in the same folder. Please note that file paths here are written for Linux/macOS operating systems,
windows users will need to modify as needed.
The following few lines of code can perform the complete LDWeaver analysis.
``` r library(LDWeaver)
WARNING! Possible runtime > 1h
setwd("~/LDWeaver_run")
dset <- "msch" alnpath <- "spn23fmsch.aln.gz" gbk_path <- system.file("extdata", "sample.gbk", package = "LDWeaver")
LDWeaver::LDWeaver(dset = dset, alnpath = alnpath, gbkpath = gbkpath, saveadditionaloutputs = T) ```
LDWeaver::LDWeaver() one-liner is versatile for most
analyses. If previously created outputs are available in <dset>, this
function will load those instead of repeating possibly time and resource heavy analysis.
It is also possible to write customised pipelines using available functions. For a full list of available functions
and options, run: help(package="LDWeaver").
``` r library(LDWeaver) dset <- "msch" dir.create(dset) # folder to save outputs
alnpath <- "~/LDWeaverrun/spn23fmsch.aln.gz" gbkpath <- system.file("extdata", "sample.gbk", package = "LDWeaver") ncores = parallel::detectCores()
snp.dat = LDWeaver::parsefastaalignment(alnpath = alnpath) # parse the alignment and extract SNPs gbk = LDWeaver::parsegenbankfile(gbkpath = gbkpath, g = snp.dat$g) # parse the annotation cdsvar = LDWeaver::estimatevariationinCDS(gbk = gbk, snp.dat = snp.dat, ncores = ncores, numclustsCDS = 3, clustpltpath = "msch/CDS_clustering.png") ```

``` r hdw = LDWeaver::estimateHammingdistance_weights(snp.dat = snp.dat) # Hamming distance weights
Perform MI computation model fitting and ARACNE - this will take some time...
srlinks = LDWeaver::performMIcomputation(snp.dat = snp.dat, hdw = hdw, cdsvar = cdsvar, ncores = ncores, lrsavepath = "msch/lrlinks.tsv", srsavepath = "msch/srlinks.tsv", pltfolder = dset) ```

``` r
Generate GWES Plots (short-range)
LDWeaver::makegwesplots(srlinks = srlinks, plt_folder = dset) ```

``` r
Identify the top hits by performing snpEff annotations
tophits = LDWeaver::performsnpEffannotations(dsetname = dset, annotationfolder = file.path(getwd(), dset), gbk = gbk, gbkpath = gbkpath, cdsvar = cdsvar, linksdf = srlinks, snp.dat = snp.dat, tophitspath = "msch/srtophits.tsv") ```
This will generate several outputs comprising annotations into the <msch> folder, please see Performing Annotations.
``` r
Generate tanglegram
LDWeaver::createtanglegram(tophits = tophits, gbk = gbk, tanglegramfolder = "msch/SR_Tanglegram") ```
Above line will create 5 tanglegrams in html format, the first one
should look like this: 
``` r
Generate GWES Explorer outputs
LDWeaver::writeoutputforgwesexplorer(snp.dat = snp.dat, tophits = tophits, gwesexplorerfolder = "msch/SR_GWESExplorer") ```
Above line will create three files in <msch/SRGWESExplorer/> that can be
used as inputs for
<a href="https://github.com/jurikuronen/GWES-Explorer" target="blank">GWESExplorer.
In addition, it is possible to provide a GFF3 annotation file,
phenotype data and a core genome phylogeny. The circular GWESExplorer plot should look like this:

Next step is to analyse the long range links ``` r
Analyse long range links
LDWeaver::analyselongrangelinks(dset = dset, lrlinkspath = "msch/lrlinks.tsv",
srlinkspath = "msch/srlinks.tsv", SnpEffAnnotate = T,
snp.dat = snp.dat, gbkpath = gbkpath, cdsvar = cdsvar)
```

By this time, the msch folder is cluttered with a large number of outputs.
LDWeaver::LDWeaver one-liner will automatically tidy up these files in to several folders.
This can be done using the cleanup function.
r
LDWeaver::cleanup(dset)
It is possible to generate a genomewide LD distribution map using the following:
r
LDWeaver::genomewide_LDMap(lr_links_path = "msch/Temp/lr_links.tsv",
sr_links_path = "msch/Temp/sr_links.tsv",
plot_save_path = "msch/GWLD.png")
Note The paths have now updated after running LDWeaver::cleanup().

A network plot for the pbp genes showcasing both the number of links between
sites and their magnitude can be generated using:
``` r
Generate the Network Plot for pbp genes
network = LDWeaver::createnetworkforgene("pbp", srannotatedpath = "msch/Annotatedlinks/srlinksannotated.tsv", lrannotatedpath = "msch/Annotatedlinks/lrlinks_annotated.tsv", level = 2)
LDWeaver::createnetwork(network,
plottitle = "pbp network",
netplotpath = "msch/pbpnetwork.png",
plotw = 2000, ploth = 2000)
```

Additional Information
Note With LDWeaver >1.5, you can analyse mega scale datasets with > 2^(32-1) elements. This requires spam and spam64 packages. Set
mega_dset=TinLDWeaver::LDWeaver()to use this feature. Warning! This is currently considerably slower than the default mode (mega_dset=F) and only supports single core operations. There will also be minor discrepancies between the two methods due to floating point errors, however, this should only have a minimal impact on the final link ranking.
Key Outputs
If the above steps worked as expected, the following output will be saved to a
folder called sample, which should be created in the current working directory.
(Working directory can be queried using: getwd()).
- Figures
- sample/cX_fit.png - shows the distribution and modelling of the background linkage disequilibrium (estimated using weighted Mutual Information) vs. bp-separation within each cluster (X = 1,2 in the example)
- sample/CDS_clustering.png - shows the genome partitioning, based on the CDS diversity (compared to the reference sequence)
- sample/srgwesclust.png - short-range GWES plot for each cluster (2 in this case)
- sample/srgwescombi.png - combined short-range GWES plot (for links with bp positions spanning two clusters, the max srp_value is used)
- sample/lrgwes.png - Long range GWES plot (similar to the output from <a href="https://github.com/santeripuranen/SpydrPick" target="blank">SpydrPick)
- Outputs
- sample/srlinks.tsv - tab separated file containing details on short-range links (i.e. links <= srdist bp apart)
- sample/lrlinks.tsv - tab separated file containing details on long-range links (i.e. links > srdist bp apart)
Extra Outputs
Note The default
sr_distvalue in LDWeaver is 20000bp (user modifiable). This means that by default, links from SNPs <20kb apart are considered short-range.
- Additional Outputs (not generated) - can be used to avoid costly re-computations.
- AdditionalOutputs/snpACGTN.rds - list comprising sparse SNP data from the alignment
- AdditionalOutputs/parsedgbk.rds - GenBankRecord of the genbank annotation data
- Additional_Outputs/hdw.rds - named vector comprising Hamming distance weights for each sequence
- AdditionalOutputs/cdsvar.rds - list comprising alignment diversity information
Note For very large datsets, the user has the option to set
save_additional_outputs=T. When these four files are present in <dset>/Additional_Outputs/, the saved information will be used instead of re-computing.
Performing Annotations
By default, LDWeaver performs detailed annotations on all link SNPs using
SnpEff.
This will create the following outputs in <dset>. Note that X here
refers to sr (short range) or lr (long range).
- Outputs
- Annotatedlinks/Xlinksannotated.tsv - tab separated file similar to sample/Xlinks.tsv with additional SnpEff annotations and allele distribution information
- Tophits/Xtophits.tsv - tab separated file containing the top 250 links (user modifiable with `maxtophipts`) . Several filters are applied to extract the top links from Annotatedlinks/Xlinks_annotated.tsv
- SR_Tanglegram - folder compirising html tanglegrams to easily visualise links and the corresponding genomic regions
- GWESExplorer/XGWESExplorer - folder containing the outputs necessary to dynamically explore links using <a href="https://github.com/jurikuronen/GWES-Explorer" target="blank">GWESExplorer (X = sr,lr).
Note The default srpcutoff is 3 (i.e., p=0.001). Short-range links with p>0.001 are automatically discarded, this can be modified using the <srpcutoff> option. The default maxtophits value is 250, this can be modified using the <maxtophits> option.
- Temporary files created during snpEff annotations. These are all written to <dset>/Temp and can be ignored or safely deleted)
- Temp/snpEff_data - data folder for snpEff
- Temp/snpEff.config - configuration file for snpEff
- Temp/Xannotations.tsv - tab separated file containing full snpEff annotations on each site associated with a short-range GWES link with srpmax > srp_cutoff
- Temp/Xannotataedstats.genes.txt - annotations and statistics in tab separated format
- Temp/Xannotatedstats.html - annotations and statistics in html format
- Temp/Xsnps.vcf, Temp/Xsnps_ann.vcf - input and output from the snpEff annotation pipeline
Owner
- Name: Sudaraka Mallawaarachchi
- Login: Sudaraka88
- Kind: user
- Location: Melbourne, Australia
- Company: University of Melbourne
- Repositories: 1
- Profile: https://github.com/Sudaraka88
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: LDWeaver
message: ' If you find LDWeaver helpful, please cite:'
type: software
authors:
- given-names: Sudaraka
family-names: Mallawa Arachchi
email: smallawaarachchi@gmail.com
orcid: 'https://orcid.org/0000-0001-8899-3323'
affiliation: University of Oslo and University of Melbourne
identifiers:
- type: doi
value: 10.5281/zenodo.10016711
description: Zenodo
repository-code: 'https://github.com/Sudaraka88/LDWeaver/'
abstract: >-
LDWeaver accepts a sequence alignment (fasta) and its
reference annotation (genbank or gff) as inputs and
identifies linkage disequilibrium (LD) between pairs of
variants (links) that is unusually high given the genomic
distance between the pair. This high LD could be the
result of co-selection or epistasis.
keywords:
- Microbial Genomics
- Linkage Disequilibrium
- Epistasis
- Gene Interactions
license: GPL-3.0-or-later
version: v1.5
date-released: '2024-01-04'
GitHub Events
Total
- Create event: 3
- Release event: 1
- Issues event: 3
- Watch event: 2
- Delete event: 2
- Issue comment event: 7
- Push event: 10
- Pull request event: 1
Last Year
- Create event: 3
- Release event: 1
- Issues event: 3
- Watch event: 2
- Delete event: 2
- Issue comment event: 7
- Push event: 10
- Pull request event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 2
- Total pull requests: 1
- Average time to close issues: 23 days
- Average time to close pull requests: 29 minutes
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 2.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 1
- Average time to close issues: 23 days
- Average time to close pull requests: 29 minutes
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 2.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- braddmg (1)
- cammmer (1)
Pull Request Authors
- Sudaraka88 (5)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v3 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
- R >= 4.0.0 depends
- GenomicRanges * imports
- Matrix * imports
- MatrixExtra * imports
- RColorBrewer * imports
- Rcpp * imports
- RcppArmadillo * imports
- ape * imports
- chromoMap * imports
- data.table * imports
- dplyr * imports
- fitdistrplus * imports
- genbankr * imports
- ggnewscale * imports
- ggplot2 * imports
- ggraph * imports
- ggtree * imports
- grDevices * imports
- heatmap3 * imports
- htmlwidgets * imports
- igraph * imports
- methods * imports
- parallel * imports
- phytools * imports
- plyr * imports
- stats * imports
- utils * imports