Mashtree

Mashtree: a rapid comparison of whole genome sequence files - Published in JOSS (2019)

https://github.com/lskatz/mashtree

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

bioperl mash mash-distances tree

Scientific Fields

Biology Life Sciences - 60% confidence
Last synced: 4 months ago · JSON representation

Repository

:deciduous_tree: Create a tree using Mash distances

Basic Info
  • Host: GitHub
  • Owner: lskatz
  • License: gpl-3.0
  • Language: Perl
  • Default Branch: master
  • Homepage:
  • Size: 50.2 MB
Statistics
  • Stars: 170
  • Watchers: 12
  • Forks: 26
  • Open Issues: 18
  • Releases: 46
Topics
bioperl mash mash-distances tree
Created over 9 years ago · Last pushed about 2 years ago
Metadata Files
Readme Changelog Contributing License

README.md

mashtree

DOI

Create a tree using Mash distances.

For simple usage, see mashtree --help. This is an example command:

mashtree *.fastq.gz > tree.dnd

For confidence values, run either with --help: mashtree_bootstrap.pl or mashtree_jackknife.pl.

Two modes: fast or accurate

Input files: fastq files are interpreted as raw read files. Fasta, GenBank, and EMBL files are interpreted as genome assemblies. Compressed files are also accepted of any of the above file types. You can compress with gz, bz2, or zip.

Output files: Newick (.dnd). If --outmatrix is supplied, then a distance matrix too.

See the documentation on the algorithms for more information.

Faster

mashtree --numcpus 12 *.fastq.gz [*.fasta] > mashtree.dnd

More accurate

You can get a more accurate tree with the minimum abundance finder. Simply give --mindepth 0. This step helps ignore very unique kmers that are more likely read errors.

mashtree --mindepth 0 --numcpus 12 *.fastq.gz [*.fasta] > mashtree.dnd

Adding confidence values

Mashtree can add confidence values using jack knifing. For each jack knife tree, 50% of hashes are used. Confidence values are calculated from the jack knife trees using BioPerl. When using this method, you can pass flags to mashtree using the double-dash like in the example below.

Added in version 0.40.

mashtree_jackknife.pl --reps 100 --numcpus 12 *.fastq.gz -- --min-depth 0 > mashtree.jackknife.dnd
mashtree_jackknife.pl --help # additional usage help

Bootsrapping was added in version 0.55. This runs mashtree itself multiple times, each with a random seed.

mashtree_bootstrap.pl --reps 100 --numcpus 12 *.fastq.gz -- --min-depth 0 > mashtree.bootstrap.dnd

Usage

Usage: mashtree [options] *.fastq *.fasta *.gbk *.msh > tree.dnd
NOTE: fastq files are read as raw reads;
      fasta, gbk, and embl files are read as assemblies;
      Input files can be gzipped.
--tempdir            ''   If specified, this directory will not be
                          removed at the end of the script and can
                          be used to cache results for future
                          analyses.
                          If not specified, a dir will be made for you
                          and then deleted at the end of this script.
--numcpus            1    This script uses Perl threads.
--outmatrix          ''   If specified, will write a distance matrix
                          in tab-delimited format
--file-of-files           If specified, mashtree will try to read
                          filenames from each input file. The file of
                          files format is one filename per line. This
                          file of files cannot be compressed.
--outtree                 If specified, the tree will be written to
                          this file and not to stdout. Log messages
                          will still go to stderr.
--version                 Display the version and exit

TREE OPTIONS
--truncLength        250  How many characters to keep in a filename
--sort-order         ABC  For neighbor-joining, the sort order can
                          make a difference. Options include:
                          ABC (alphabetical), random, input-order

MASH SKETCH OPTIONS
--genomesize         5000000
--mindepth           5    If mindepth is zero, then it will be
                          chosen in a smart but slower method,
                          to discard lower-abundance kmers.
--kmerlength         21
--sketch-size        10000

Installation

Please see INSTALL.md

Further documentation

For perl library help, run perldoc on a .pm file, e.g., perldoc lib/Mashtree/Db.pm.

For executable help run --help, e.g., mashtree_bootstrap.pl --help.

For more information and help please see the docs folder

For more information on plugins, see the plugins folder. (in development)

For more information on contributions, please see CONTRIBUTING.md.

References

  • Mash: http://mash.readthedocs.io
  • BioPerl: http://bioperl.org

Citation

JOSS

Katz, L. S., Griswold, T., Morrison, S., Caravas, J., Zhang, S., den Bakker, H.C., Deng, X., and Carleton, H. A., (2019). Mashtree: a rapid comparison of whole genome sequence files. Journal of Open Source Software, 4(44), 1762, https://doi.org/10.21105/joss.01762

Poster

Katz, L. S., Griswold, T., & Carleton, H. A. (2017, October 8-11). Generating WGS Trees with Mashtree. Poster presented at the American Society for Microbiology Conference on Rapid Applied Microbial Next-Generation Sequencing and Bioinformatic Pipelines, Washington, DC. Poster number 27.

Owner

  • Name: Lee Katz
  • Login: lskatz
  • Kind: user
  • Location: Atlanta, GA
  • Company: CDC (work) + personal projects

JOSS Publication

Mashtree: a rapid comparison of whole genome sequence files
Published
December 10, 2019
Volume 4, Issue 44, Page 1762
Authors
Lee S. Katz ORCID
Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA, Center for Food Safety, University of Georgia, Griffin, GA, USA
Taylor Griswold
Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA
Shatavia S. Morrison ORCID
Respiratory Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA
Jason A. Caravas ORCID
Respiratory Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA
Shaokang Zhang ORCID
Center for Food Safety, University of Georgia, Griffin, GA, USA
Henk C. den Bakker ORCID
Center for Food Safety, University of Georgia, Griffin, GA, USA
Xiangyu Deng
Center for Food Safety, University of Georgia, Griffin, GA, USA
Heather A. Carleton
Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA
Editor
Charlotte Soneson ORCID
Tags
dendrogram mash sketch tree rapid

GitHub Events

Total
  • Issues event: 6
  • Watch event: 18
  • Issue comment event: 5
  • Fork event: 2
Last Year
  • Issues event: 6
  • Watch event: 18
  • Issue comment event: 5
  • Fork event: 2

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 407
  • Total Committers: 4
  • Avg Commits per committer: 101.75
  • Development Distribution Score (DDS): 0.015
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Lee Katz - Aspen g****2@c****v 401
Mohammad S Anwar m****r@y****m 4
Franklin Bristow f****w@g****m 1
Charlotte Soneson c****n@g****m 1
Committer Domains (Top 20 + Academic)
cdc.gov: 1

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 85
  • Total pull requests: 9
  • Average time to close issues: 4 months
  • Average time to close pull requests: about 12 hours
  • Total issue authors: 49
  • Total pull request authors: 4
  • Average comments per issue: 3.0
  • Average comments per pull request: 0.89
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 6
  • Pull requests: 0
  • Average time to close issues: 4 months
  • Average time to close pull requests: N/A
  • Issue authors: 6
  • Pull request authors: 0
  • Average comments per issue: 1.33
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • tseemann (19)
  • lskatz (6)
  • mihkelvaher (5)
  • mdiricks (3)
  • samlipworth (2)
  • schultzm (2)
  • karel-brinda (2)
  • noorshu (2)
  • Rob-murphys (2)
  • vaofford (2)
  • JChristopherEllis (1)
  • noaheb98 (1)
  • andrewsanchez (1)
  • hmontenegro (1)
  • caizhangbin (1)
Pull Request Authors
  • manwar (4)
  • lskatz (3)
  • fbristow (1)
  • csoneson (1)
Top Labels
Issue Labels
help wanted (5) wontfix (3) enhancement (2)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 38
  • Total maintainers: 1
metacpan.org: Mashtree

functions for Mashtree databasing

  • License: gpl_3
  • Latest release: v1.4.6
    published about 2 years ago
  • Versions: 38
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 0.2%
Forks count: 0.7%
Dependent repos count: 1.6%
Average: 8.7%
Dependent packages count: 32.2%
Maintainers (1)
Last synced: 4 months ago