tadrep

Targeted Detection and Reconstruction of Plasmids

https://github.com/oschwengers/tadrep

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
3 of 3 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary

Keywords

bacteria plasmids wgs

Last synced: 6 months ago · JSON representation

Repository

Targeted Detection and Reconstruction of Plasmids

Basic Info

Host: GitHub
Owner: oschwengers
License: gpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 2.63 MB

Statistics

Stars: 21
Watchers: 2
Forks: 1
Open Issues: 3
Releases: 4

Topics

bacteria plasmids wgs

Created almost 4 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Code of conduct Citation

README.md

PyPI - Python Version PyPI - Status GitHub release

TaDReP: Targeted Detection and Reconstruction of Plasmids

TaDReP is a tool for the rapid targeted detection and reconstruction of plasmids within bacterial draft genomes.

Description
Installation
Input & Output
Overview
Usage
- Setup
- Database
- Extract
- Characterize
- Cluster
- Detect
- Visualize
Issues & Feature Requests

Description

TaDReP facilitates the rapid screening for target plasmids within single or cohorts of draft genomes.

It detects and reconstructs reference plasmids within bacterial draft assemblies via Blast+ alignments of draft genome contigs that are rigourously filtered for coverage and sequence identity thresholds. Finally, reference plasmids are detected and reconstructed if strict thresholds regarding plasmid-wise coverage and sequence identity are met.

Installation

TaDRep can be installed via Conda and Pip. However, we encourage to use Conda to automatically install all required 3rd party dependencies.

Conda

bash conda install -c conda-forge -c bioconda tadrep

Pip

bash $ python3 -m pip install --user tadrep

Input and Output

Input

TaDReP accepts bacterial draft genome assemblies in (zipped) fasta format. Complete reference plasmid sequences are either extracted from (semi-)closed genomes or plasmid sequence collections, or created from public plasmid databases (RefSeq / PLSDB). For further information how to extract plasmid sequences, please read the extract section below.

Output

For each draft genome TaDReP writes a TSV summary file providing all detected reference plasmids and aligned genome contigs. For each reference plasmid that was detected in a draft assembly, ordered and rearranged contigs are exported as N-merged scaffolds, as well as mere contigs. Furthermore, for each reconstructed plasmid, the reference plasmid backbone and all contig alignments are visualized (PDF).

<genome>-summary.tsv: detailed per contig alignment summary
<genome>-<plasmid>-contigs.fna: ordered and rearranged contigs of the reconstructed plasmid
<genome>-<plasmid>-pseudo.fna: pseudomolecule sequence of the reconstructed plasmid
<genome>-<plasmid>.pdf: visualization of aligned contigs against the detected reference plasmid

If multiple genomes are provided, TaDReP also provides a presence/absence matrix of all detected plasmids as a cohort analyses, as well as a short summary of plasmids and which contigs were matched in each genome.

plasmids.info: plasmid characterization summary
plasmids.tsv: presence/absence table of detected plasmids
summary.tsv: short summary of matched contigs through all genomes
tadrep.log: log-file for debugging

Overview

TaDReP overview

Usage

TaDReP's workflow comprises seven steps implement in CLI submodules to ease semi-automated multi-step analyses.

``` usage: TaDReP [--help] [--verbose] [--threads THREADS] [--tmp-dir TMP_DIR] [--version] [--output OUTPUT] [--prefix PREFIX] ...

Targeted Detection and Reconstruction of Plasmids

General: --help, -h Show this help message and exit --verbose, -v Print verbose information --threads THREADS, -t THREADS Number of threads to use (default = number of available CPUs) --tmp-dir TMP_DIR Temporary directory to store blast hits --version show program's version number and exit

General Input / Output: --output OUTPUT, -o OUTPUT Output directory (default = current working directory) --prefix PREFIX Prefix for all output files (default = None)

Submodules:

setup               Download and prepare inc-types
database            Download and create database for TaDReP
extract             Extract unique plasmid sequences
characterize        Identify plasmids with GC content, Inc types, conjugation genes
cluster             Cluster related plasmids
detect              Detect and reconstruct plasmids in draft genomes
visualize           Visualize plasmid coverage of contigs

Citation: Schwengers et al. (2023) TaDReP: Targeted Detection and Reconstruction of Plasmids. GitHub https://github.com/oschwengers/tadrep ```

Setup

The setup module downloads external databases, e.g. PlasmidFinders incompatibility groups that are required to characterize plasmids.

Example

Verbosely download inc-types:

bash tadrep -v -o <output-path> setup

Database

The database module downloads public plasmid databases (PLSDB / RefSeq) into a reference plasmid file. This creates a subdirectory in a user specified output directory.

If you downloaded a database, you can skip the extract step and start with the characterization.

```bash usage: TaDReP database [-h] [--type {refseq,plsdb}] [--force]

options: -h, --help show this help message and exit

Input / Output: --type {refseq,plsdb} External DB to import (default = 'refseq') --force, -f Force download and new setup of database ```

Examples

Create refseq database:

bash tadrep -v -o <output-path> database --type refseq

Create PLSDB database:

bash tadrep -v -o <output-path> database --type plsdb

Overwrite existing refseq files with newly downloaded data.

bash tadrep -v -o <output-path> database --type refseq -f

Extract

The extract module Extracts reference plasmid sequences from complete genomes, (semi-)draft genomes or plasmid files.

```bash usage: TaDReP extract [-h] [--type {genome,plasmid,draft}] [--header HEADER] [--files FILES [FILES ...]] [--discard-longest DISCARDLONGEST] [--max-length MAXLENGTH]

options: -h, --help show this help message and exit

Input: --type {genome,plasmid,draft}, -t {genome,plasmid,draft} Type of input files --header HEADER Template for header description inside input files: e.g.: header: ">pl1234" --> --header "pl" --files FILES [FILES ...], -f FILES [FILES ...] File path --discard-longest DISCARDLONGEST, -d DISCARDLONGEST Discard n longest sequences in output --max-length MAXLENGTH, -m MAXLENGTH Max sequence length (default = 1000000 bp) ```

For different input types specified via --type:

genome: extract all but the longest sequence. This can be adjusted via --discard-longest.
draft: extracts only sequences with specific headers. Headers can be specified via --header.
plasmid: extracts all sequences from a given file without any filtering.

If you extracted references, you can skip the database step and start with the characterization.

Examples

Extract all sequences from file plasmids.fna ignoring the two longest:

bash tadrep -v --type genome --discard-longest 2 --files plasmids.fna

Extract all sequences from file plasmids.fna where header contains pl:

bash tadrep -v --type draft --header "pl" --files plasmids.fna

Extract all potential plasmid sequences (one of 'plasmid', 'complete', 'circular=true' in header) from file plasmids.fna ignoring sequences longer than 500000 bp:

bash tadrep -v --type draft --max-length 500000 --files plasmids.fna

Extract all sequences from file plasmids.fna:

bash tadrep -v --type plasmid --files plasmids.fna

Characterize

The characterize module characterizes all reference plasmids by the following features:

Length
GC content
Incompatibility types
Number of coding sequences

If you downloaded a reference database this is the step to start with.

```bash usage: TaDReP characterize [-h] [--db DATABASE] [--inc-types INC_TYPES]

optional arguments: -h, --help show this help message and exit

Input: --db DATABASE Import json file from a given database path into working directory --inc-types INC_TYPES Import inc-types from given path into working directory ```

Examples

Characterize plasmids in working directory <output-path> and import inc-types from inc-types folder:

bash tadrep -v -o <output-path> characterize --inc-types inc-types/inc-types.fasta

If inc-types is already present inside the working directory, the parameter --inc-types can be omitted:

bash tadrep -v -o <output-path> characterize

If you downloaded a database you can import it into the working directory <output-path> with the --db parameter:

bash tadrep -v -o <output-path> characterize --db databases/plsdb/plsdb.json --inc-types inc-types/inc-types.fasta

Cluster

The cluster module groups plasmids with similar sequences and features.

```bash usage: TaDReP cluster [-h] [--min-sequence-identity [1-100]] [--max-sequence-length-difference [1-1000000]] [--skip]

options: -h, --help show this help message and exit

Parameter: --min-sequence-identity [1-100] Minimal plasmid sequence identity (default = 90%) --max-sequence-length-difference [1-1000000] Maximal plasmid sequence length difference in basepairs (default = 1000) --skip, -s Skips clustering, one group for each plasmid ```

Example

bash tadrep -v cluster

Detect

The detect module aligns contigs of bacterial draft genomes to reference plasmids using BLAST+. Each match is evaluated by coverage and sequence identity of the aligned plasmid section and can be individualy adjusted by using --min-plasmid-identity and --min-plasmid-coverage. If various contigs match a plasmid and the combined coverage and identity exceed a certain threshold, the combination of aligned contigs is saved.

Each detected plasmid is reconstructed as a pseudo sequence, where matching contigs are linked by a sequence of N. Information on detected & reconstructed plasmids and in which draft genomes they were found in provided in a summary and a presence-absence table.

```bash usage: TaDReP detect [-h] [--genome GENOME [GENOME ...]] [--min-contig-coverage [1-100]] [--min-contig-identity [1-100]] [--min-plasmid-coverage [1-100]] [--min-plasmid-identity [1-100]] [--gap-sequence-length GAPSEQUENCELENGTH]

optional arguments: -h, --help show this help message and exit

Input / Output: --genome GENOME [GENOME ...], -g GENOME [GENOME ...] Draft genome path

Annotation: --min-contig-coverage [1-100] Minimal contig coverage (default = 90%) --min-contig-identity [1-100] Maximal contig identity (default = 90%) --min-plasmid-coverage [1-100] Minimal plasmid coverage (default = 80%) --min-plasmid-identity [1-100] Minimal plasmid identity (default = 90%) --gap-sequence-length GAPSEQUENCELENGTH Gap sequence N length (default = 10) ```

Examples

Detect reference plasmids from directory <output-path> in file draft.fna with default settings:

bash tadrep -v -o <output-path> detect --genome draft.fna

Detect reference plasmids from directory <output-path> in file draft.fna;

75% of contig length has to be covered by a match;

Combined contig matches have to cover at least 95% of reference plasmid length:

bash tadrep -v -o <output-path> detect --genome draft.fna --min-contig-coverage 75 --min-plasmid-coverage 95

Detect reference plasmids from directory <output-path> in file draft.fna;

Contig sequence of a match has to be at least 80% identical to reference plasmid;

Combined contig matches have to sum up to at least 95% identity of reference plasmid sequence:

bash tadrep -v -o <output-path> detect --genome draft.fna --min-contig-identity 80 --min-plasmid-identity 95

Note: --min-contig-coverage / --min-plasmid-identity and --min-contig-identity / --min-plasmid-coverage can be combined as well.

Visualize

The visualize module visualizes matching contigs from draft genomes for each detected plasmid.

By default, contigs are represented by boxes, either on top or bottom of the plasmid center line. The position of the boxes represents a match on either forward or backward strand respectively. A colour gradient is used to indicate the identity between contig and plasmid section, a brighter colorization implies smaller sequence identity. The start of this gradient, where it is the brightest, can be individually set with the --interval-start parameter.

```bash usage: TaDReP visualize [-h] [--plotstyle {bigarrow,arrow,bigbox,box,bigrbox,rbox}] [--labelcolor LABELCOLOR] [--linewidth LINEWIDTH] [--arrow-shaft-ratio ARROWSHAFTRATIO] [--size-ratio SIZERATIO] [--labelsize LABELSIZE] [--labelrotation LABELROTATION] [--labelhpos {left,center,right}] [--labelha {left,center,right}] [--interval-start [0-100]] [--number-of-intervals [1-100]] [--omitratio [0-100]]

optional arguments: -h, --help show this help message and exit

Style: --plotstyle {bigarrow,arrow,bigbox,box,bigrbox,rbox} Contig representation in plot --labelcolor LABELCOLOR Contig label color --linewidth LINEWIDTH Contig edge linewidth --arrow-shaft-ratio ARROWSHAFTRATIO Size ratio between arrow head and shaft --size-ratio SIZE_RATIO Contig size ratio to track

Label: --labelsize LABELSIZE Contig label size --labelrotation LABELROTATION Contig label rotation --labelhpos {left,center,right} Contig label horizontal position --labelha {left,center,right} Contig label horizontal alignment

Gradient: --interval-start [0-100] Percentage where gradient should stop --number-of-intervals [1-100] Number of gradient intervals

Omit: --omit_ratio [0-100] Omit contigs shorter than X percent of plasmid length from plot ```

Examples

Visualize results from detection in directory <output-path> with default settings:

bash tadrep -v -o <output-path> visualize

Visualize results from detection in directory sho<output-path>wcase;

Brightest colour of gradient starts at 95.5% sequence identity (darkest colour is always 100% identity);

Surround contig blocks with 1px line:

bash tadrep -v -o <output-path> visualize --interval-start 95.5 --linewidth 1

Issues & Feature Requests

TaDReP is brand new and like in every software, expect some bugs lurking around. So, if you run into any issues with TaDReP, we'd be happy to hear about it. Therefore, please, execute it in verbose mode (-v) and do not hesitate to file an issue including as much information as possible:

a detailed description of the issue
command line output
log file (tadrep.log)
a reproducible example of the issue with an input file that you can share if possible

Owner

Name: Oliver Schwengers
Login: oschwengers
Kind: user
Location: Giessen, Germany
Company: @ag-computational-bio - JLU Giessen

Twitter: oschwengers1
Repositories: 6
Profile: https://github.com/oschwengers

Microbial bioinformatics, WGS bacteria, plasmids, PostDoc, father of 2, husband, astrophotographer

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: over 1 year ago

All Time

Total Commits: 109
Total Committers: 3
Avg Commits per committer: 36.333
Development Distribution Score (DDS): 0.128

Past Year

Commits: 2
Committers: 1
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Oliver Schwengers	o**s@c**e	95
aguthman	a**n@h**e	12
Adrian Guthmann	a**n@b**e	2

Committer Domains (Top 20 + Academic)

bioinfsys.uni-giessen.de: 1 hrz.uni-giessen.de: 1 computational.bio.uni-giessen.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 9
Total pull requests: 8
Average time to close issues: 3 months
Average time to close pull requests: 3 days
Total issue authors: 6
Total pull request authors: 2
Average comments per issue: 0.56
Average comments per pull request: 0.0
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

AGuthmann (3)
oschwengers (2)
pavlo888 (1)
elozanoe (1)
lfenske-93 (1)

Pull Request Authors

AGuthmann (7)
oschwengers (1)

Top Labels

Issue Labels

enhancement (5) bug (2) help wanted (1)

Pull Request Labels

enhancement (1)

Packages

Total packages: 1
Total downloads:
- pypi 16 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 4
Total maintainers: 1

pypi.org: tadrep

TaDRep: Targeted Detection and Reconstruction of Plasmids

Homepage: https://github.com/oschwengers/tadrep
Documentation: https://github.com/oschwengers/tadrep/blob/main/README.md
License: GPLv3
Latest release: 0.9.2
published almost 2 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 16 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 13.3%

Average: 20.1%

Dependent repos count: 21.7%

Forks count: 22.6%

Downloads: 32.9%

Maintainers (1)

oschwengers

Last synced: 6 months ago

Dependencies

setup.py pypi

biopython *
xopen *

.github/workflows/cd-pypi.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite
pypa/gh-action-pypi-publish master composite

.github/workflows/ci-lint.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/ci-package.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/ci-test.yml actions

actions/checkout v2 composite
mamba-org/setup-micromamba v1 composite

tadrep/setup.py pypi

environment.yml conda

biopython >=1.78
blast >=2.12.0
cd-hit >=4.8.1
matplotlib >=3.7
pygenomeviz >=0.4
pyrodigal >=2.1.0
xopen >=1.5.0