TipToft

TipToft: detecting plasmids contained in uncorrected long read sequencing data - Published in JOSS (2019)

https://github.com/andrewjpage/tiptoft

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, joss.theoj.org
  • Committers with academic emails
    2 of 5 committers (40.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

bioinformatics bioinformatics-pipeline genomics global-health infectious-diseases kmer long-reads nanopore oxford-nanopore pacbio pathogen plasmid plasmidfinder research uncorrected

Scientific Fields

Biology Life Sciences - 63% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Predict plasmids from uncorrected long read data

Basic Info
  • Host: GitHub
  • Owner: andrewjpage
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Size: 37.8 MB
Statistics
  • Stars: 40
  • Watchers: 3
  • Forks: 10
  • Open Issues: 1
  • Releases: 4
Topics
bioinformatics bioinformatics-pipeline genomics global-health infectious-diseases kmer long-reads nanopore oxford-nanopore pacbio pathogen plasmid plasmidfinder research uncorrected
Created over 7 years ago · Last pushed about 6 years ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation Authors

README.md

TipToft

Given some raw uncorrected long reads, such as those from PacBio or Oxford Nanopore, predict which plasmid should be present. Assemblies of long read data can often miss out on plasmids, particularly if they are very small or have a copy number which is too high/low when compared to the chromosome. This software gives you an indication of which plasmids to expect, flagging potential issues with an assembly.

Build Status License: GPL v3 codecov Docker Build Status Docker Pulls

Paper

DOI

AJ Page, T Seemann (2019). TipToft: detecting plasmids contained in uncorrected long read sequencing data. Journal of Open Source Software, 4(35), 1021, https://doi.org/10.21105/joss.01021

Please remember to cite the plasmidFinder paper as their database makes this software work:

Carattoli et al, In Silico Detection and Typing of Plasmids using PlasmidFinder and Plasmid Multilocus Sequence Typing, Antimicrob Agents Chemother. 2014;58(7):3895–3903. view

Installation

The only dependancies are Python3 and a compiler (gcc, clang,...) and this should work on Linux or OSX. Cython needs to be installed in advance. Assuming you have Python 3.4+ and pip installed, just run: pip3 install cython pip3 install tiptoft

or if you wish to install the latest development version: pip3 install git+git://github.com/andrewjpage/tiptoft.git

Debian/Ubuntu (Trusty/Xenial)

To install Python3 on Ubuntu run: sudo apt-get update -qq sudo apt-get install -y git python3 python3-setuptools python3-biopython python3-pip pip3 install cython pip3 install tiptoft

Docker

Install Docker. There is a docker container which gets automatically built from the latest version of TipToft. To install it:

docker pull andrewjpage/tiptoft

To use it you would use a command such as this (substituting in your filename/directories), using the example file in this respository: docker run --rm -it -v /path/to/example_data:/example_data andrewjpage/tiptoft tiptoft /example_data/ERS654932_plasmids.fastq.gz

Homebrew

Install Brew for OSX or LinuxBrew for Linux, then run:

brew install python # this is python v3 pip3 install cython pip3 install tiptoft

Bioconda

Install Bioconda, then run:

conda install tiptoft

Windows

It has been reported that the software works when using Ubuntu on Windows 10. This is not a supported platform as the authors don't use windows, so use at your own risk.

Usage

tiptoftdatabasedownloader script

First of all you need plasmid database from PlasmidFinder. There is a snapshot bundled with this repository for your convenience, or alternatively you can use the downloader script to get the latest data. You will need internet access for this step. Please remember to cite the PlasmidFinder paper.

``` usage: tiptoftdatabasedownloader [options] output_prefix

Download PlasmidFinder database

positional arguments: output_prefix Output prefix

optional arguments: -h, --help show this help message and exit --verbose, -v Turn on debugging (default: False) --version show program's version number and exit ```

Just run: tiptoft_database_downloader You will now have a file called 'plasmid_files.fa' which can be used with the main script.

tiptoft script

This is the main script of the application. The mandatory inputs are a FASTQ file of long reads, which can be optionally gzipped. ``` usage: tiptoft [options] input.fastq

plasmid incompatibility group prediction from uncorrected long reads

positional arguments: input_fastq Input FASTQ file (optionally gzipped)

optional arguments: -h, --help show this help message and exit

Optional input arguments: --plasmiddata PLASMIDDATA, -d PLASMID_DATA FASTA file containing plasmid data from downloader script, defaults to bundled database (default: None) --kmer KMER, -k KMER k-mer size (default: 13)

Optional output arguments: --filteredreadsfile FILTEREDREADSFILE, -f FILTEREDREADSFILE Filename to save matching reads to (default: None) --outputfile OUTPUTFILE, -o OUTPUTFILE Output file STDOUT --printinterval PRINTINTERVAL, -p PRINTINTERVAL Print results every this number of reads (default: None) --verbose, -v Turn on debugging [False] --version show program's version number and exit

Optional advanced input arguments: --maxgap MAXGAP Maximum gap for blocks to be contigous, measured in multiples of the k-mer size (default: 3) --margin MARGIN Flanking region around a block to use for mapping (default: 10) --minblocksize MINBLOCKSIZE Minimum block size in bases (default: 130) --minfastahits MINFASTAHITS, -m MINFASTAHITS Minimum No. of kmers matching a read (default: 10) --minperccoverage MINPERCCOVERAGE, -c MINPERCCOVERAGE Minimum percentage coverage of typing sequence to report (default: 85) --minkmersforonexpass MINKMERSFORONEXPASS Minimum No. of kmers matching a read in 1st pass (default: 10) ```

Required argument

input_fastq: This is a single FASTQ file. It can be optionally gzipped. Alternatively input can be read from stdin by using the dash character (-) as the input file name. The file must contain long reads, such as those from PacBio or Oxford Nanopore. The quality scores are ignored.

Optional input arguments

plasmid_data: This is a FASTA file containing all of the plasmid typing sequences. This is generated by the tiptoftdatabasedownloader script. It comes from the PlasmidFinder website, so please be sure to cite their paper (citation gets printed every time you run the script).

kmer: The most important parameter. 13 works well for Nanopore, 15 works well for PacBio, but you may need to play around with it for your data. Long reads have a high error rate, so if you set this too high, nothing will match (because it will contain errors). If you set it too low, everything will match, which isnt much use to you. Thinking about your data, on average how long of a stretch of bases can you get in your read without errors? This is what you should set your kmer to. For example, if you have an average of 1 error every 10 bases, then the ideal kmer would be 9.

Optional output arguments

filteredreadsfile: Save the reads which contain the rep/inc sequences to a new FASTQ file. This is useful if you want to undertake a further assembly just on the plasmids.This file should not already exist.

outputfile OUTPUTFILE: By default the results are printed to STDOUT. If you provide an output filename (which must not exist already), it will print the results to the file.

print_interval: By default the whole file is processed and the final results are printed out. However you can get intermediate results printed after every X number of reads, which is useful if you are doing real time streaming of data into the application and can halt when you have enough information. They are separated by "****".

verbose: Enable debugging mode where lots of extra output is printed to STDOUT.

version: Print the version number and exit.

Optional advanced input arguments

max_gap: Maximum gap for blocks to be contigous, measured in multiples of the k-mer size. This allows for short regions of elevated errors in the reads to be spanned.

margin: Expand the analysis to look at a few bases on either side of where the sequence is predicted to be on the read. This allows for k-mers to overlap the ends.

minblocksize: This is the minimum sub read size of a read to consider for indepth analysis after matching k-mers have been identified in the read. This speeds up the analysis quite a bit, but there is the risk that some reads may be missed, particularly if they have partial rep/inc sequences.

minfastahits: This is the minimum number of matching kmers in a read, for the read to be considered for analysis. It is a hard minimum threshold to speed up analysis.

minperccoverage: Only report rep/inc sequences above this percentage coverage. Coverage in this instance is kmer coverage of the underlying sequence (rather than depth of coverage).

minkmersforonexpass: The number of k-mers that must be present in the read for the initial onex pass of the database to be considered for further analysis. This speeds up the analysis quite a bit, but there is the risk that some reads may be missed, particularly if they have partial rep/inc sequences.

Output

The output is tab delmited and printed to STDOUT by default. You can optionally print it to a file using the '-o' parameter. If you would like to see intermediate results, you can tell it to print every X reads with the '-p' parameter, separated by '****'. An example of the output is:

GENE COMPLETENESS %COVERAGE ACCESSION DATABASE PRODUCT rep7.1 Full 100 AB037671 plasmidfinder rep7.1_repC(Cassette)_AB037671 rep7.5 Partial 99 AF378372 plasmidfinder rep7.5_CDS1(pKC5b)_AF378372 rep7.6 Partial 94 SAU38656 plasmidfinder rep7.6_ORF(pKH1)_SAU38656 rep7.9 Full 100 NC007791 plasmidfinder rep7.9_CDS3(pUSA02)_NC007791 rep7.10 Partial 91 NC_010284.1 plasmidfinder rep7.10_repC(pKH17)_NC_010284.1 rep7.12 Partial 93 GQ900417.1 plasmidfinder rep7.12_rep(SAP060B)_GQ900417.1 rep7.17 Full 100 AM990993.1 plasmidfinder rep7.17_repC(pS0385-1)_AM990993.1 rep20.11 Full 100 AP003367 plasmidfinder rep20.11_repA(VRSAp)_AP003367 repUS14. Full 100 AP003367 plasmidfinder repUS14._repA(VRSAp)_AP003367

GENE: The first column is the first part of the product name.

COMPLETENESS: If all of the k-mers in the gene are found in the reads, the completeness is noted as 'Full', otherwise if there are some k-mers missing, it is noted as 'Partial'.

%COVERAGE: The percentage coverage is the number of underlying k-mers in the gene where at least 1 matching k-mer has been found in the reads. 100 indicates that every k-mer in the gene is covered. Low coverage results are not shown (controlled by the --minperccoverage parameter).

ACCESSION: This is the accession number from where the typing sequence originates. You can look this up at NCBI or EBI.

DATABASE: This is where the data has come from, which is currently always plasmidfinder.

PRODUCT: This is the full product of the gene as found in the database.

Example usage

A real test file is bundled in the repository. Download it then run:

tiptoft ERS654932_plasmids.fastq.gz

The expected output is in the repository. This uses a bundled database, however if you wish to use the latest up to date database, you should run the tiptoftdatabasedownloader script.

Resource usage

For an 800 MB FASTQ file (unzipped) of long reads from a Oxford Nanopore MinION containing Salmonella required 80 MB of RAM and took under 1 minute.

License

TipToft is free software, licensed under GPLv3.

Feedback/Issues

Please report any issues to the issues page.

Contribute to the software

If you wish to fix a bug or add new features to the software we welcome Pull Requests. We use GitHub Flow style development. Please fork the repo, make the change, then submit a Pull Request against out master branch, with details about what the change is and what it fixes/adds. We will then review your changes and merge them, or provide feedback on enhancements.

Owner

  • Name: andrewjpage
  • Login: andrewjpage
  • Kind: user
  • Location: Cambridge, UK
  • Company: Theiagen Genomics

Director of Technical Innovation

JOSS Publication

TipToft: detecting plasmids contained in uncorrected long read sequencing data
Published
March 01, 2019
Volume 4, Issue 35, Page 1021
Authors
Andrew J. Page ORCID
Quadram Institute Bioscience, Norwich Research Park, Norwich, UK.
Torsten Seemann ORCID
Melbourne Bioinformatics, The University of Melbourne, Parkville, Australia.
Editor
Kevin M. Moerman ORCID
Tags
bioinformatics plasmid typing long read sequencing bacteria

Citation (CITATION.cff)

# YAML 1.2
---
abstract: "In the work presented here, we designed and developed two easy-to-use Web tools for in silico detection and characterization of whole-genome sequence (WGS) and whole-plasmid sequence data from members of the family Enterobacteriaceae. These tools will facilitate bacterial typing based on draft genomes of multidrug-resistant Enterobacteriaceae species by the rapid detection of known plasmid types. Replicon sequences from 559 fully sequenced plasmids associated with the family Enterobacteriaceae in the NCBI nucleotide database were collected to build a consensus database for integration into a Web tool called PlasmidFinder that can be used for replicon sequence analysis of raw, contig group, or completely assembled and closed plasmid sequencing data. The PlasmidFinder database currently consists of 116 replicon sequences that match with at least at 80% nucleotide identity all replicon sequences identified in the 559 fully sequenced plasmids. For plasmid multilocus sequence typing (pMLST) analysis, a database that is updated weekly was generated from www.pubmlst.org and integrated into a Web tool called pMLST. Both databases were evaluated using draft genomes from a collection of Salmonella enterica serovar Typhimurium isolates. PlasmidFinder identified a total of 103 replicons and between zero and five different plasmid replicons within each of 49 S. Typhimurium draft genomes tested. The pMLST Web tool was able to subtype genomic sequencing data of plasmids, revealing both known plasmid sequence types (STs) and new alleles and ST variants. In conclusion, testing of the two Web tools using both fully assembled plasmid sequences and WGS-generated draft genomes showed them to be able to detect a broad variety of plasmids that are often associated with antimicrobial resistance in clinically relevant bacterial pathogens."
authors: 
  -
    affiliation: "Department of Infectious, Parasitic and Immuno-Mediated Diseases, Istituto Superiore di Sanità, Rome, Italy"
    family-names: Carattoli
    given-names: Alessandra
  -
    affiliation: "Danish Technical University, National Food Institute, Division for Epidemiology and Microbial Genomics, Lyngby, Denmark"
    family-names: Zankari
    given-names: Ea
  -
    affiliation: "Department of Infectious, Parasitic and Immuno-Mediated Diseases, Istituto Superiore di Sanità, Rome, Italy"
    family-names: "García-Fernández"
    given-names: Aurora
  -
    affiliation: "Danish Technical University, Center for Biological Sequence Analysis, Department of Systems Biology, Lyngby, Denmark"
    family-names: "Voldby Larsen"
    given-names: Mette
  -
    affiliation: "Danish Technical University, Center for Biological Sequence Analysis, Department of Systems Biology, Lyngby, Denmark"
    family-names: Lund
    given-names: Ole
  -
    affiliation: "Department of Infectious, Parasitic and Immuno-Mediated Diseases, Istituto Superiore di Sanità, Rome, Italy"
    family-names: Villa
    given-names: Laura
  -
    affiliation: "Danish Technical University, National Food Institute, Division for Epidemiology and Microbial Genomics, Lyngby, Denmark"
    family-names: Aarestrup
    given-names: "Frank Møller"
  -
    affiliation: "Danish Technical University, National Food Institute, Division for Epidemiology and Microbial Genomics, Lyngby, Denmark"
    family-names: Hasman
    given-names: Henrik
cff-version: "1.0.3"
doi: "10.1128/AAC.02412-14"
message: "Please remember to cite the plasmidFinder paper as their database makes this software work"
repository-code: "https://bitbucket.org/genomicepidemiology/plasmidfinder"
title: "In Silico Detection and Typing of Plasmids using PlasmidFinder and Plasmid Multilocus Sequence Typing"
...

GitHub Events

Total
Last Year

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 80
  • Total Committers: 5
  • Avg Commits per committer: 16.0
  • Development Distribution Score (DDS): 0.075
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Andrew Page a****e@q****k 74
Torsten Seemann t****n 2
Tiago Jesu t****2@g****m 2
Thanh Lê l****k@g****m 1
Andrew Page (QIB) p****a@n****k 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 18
  • Total pull requests: 10
  • Average time to close issues: 19 days
  • Average time to close pull requests: about 1 hour
  • Total issue authors: 6
  • Total pull request authors: 4
  • Average comments per issue: 1.33
  • Average comments per pull request: 0.4
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • tseemann (11)
  • ctb (3)
  • Niicii (1)
  • azneto (1)
  • conte1 (1)
  • samlipworth (1)
Pull Request Authors
  • andrewjpage (6)
  • tseemann (2)
  • tiagofilipe12 (1)
  • thanhleviet (1)
Top Labels
Issue Labels
enhancement (2) bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 18 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 6
  • Total maintainers: 1
pypi.org: tiptoft

tiptoft: predict which plasmid should be present from uncorrected long read data

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 18 Last month
Rankings
Dependent packages count: 10.1%
Stargazers count: 10.6%
Forks count: 10.9%
Average: 17.7%
Dependent repos count: 21.5%
Downloads: 35.6%
Maintainers (1)
Last synced: 4 months ago

Dependencies

setup.py pypi
  • biopython *
  • cython *
  • pyfastaq *