khmer release v2.1

khmer release v2.1: software for biological sequence analysis - Published in JOSS (2017)

https://github.com/dib-lab/khmer

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in JOSS metadata
  • Academic publication links
  • Committers with academic emails
    31 of 87 committers (35.6%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

bioinformatics bloom-filter count-min-sketch dna graph-traversal k-mer python

Keywords from Contributors

fracminhash kmer minhash scaled-minhash sketching sourmash taxonomic-classification taxonomic-profiling
Last synced: 4 months ago · JSON representation ·

Repository

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more

Basic Info
Statistics
  • Stars: 775
  • Watchers: 67
  • Forks: 296
  • Open Issues: 353
  • Releases: 14
Topics
bioinformatics bloom-filter count-min-sketch dna graph-traversal k-mer python
Created over 13 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation Authors

README.rst

|Research software impact|
|Supported Python versions|
|khmer build status|
|Test coverage|
|BSD-3 licensed|

khmer
=====

Welcome to khmer: k-mer counting, filtering, and graph traversal FTW!

The official source code repository is at https://github.com/dib-lab/khmer and project documentation is available online at http://khmer.readthedocs.io.
See http://khmer.readthedocs.io/en/stable/introduction.html for an overview of the khmer project.

Getting help
------------

See http://khmer.readthedocs.io/en/stable/user/getting-help.html for more details, but in brief:

-  first point of contact when looking for help:
   https://github.com/dib-lab/khmer/issues
-  mailing list for **discussion**:
   http://lists.idyll.org/listinfo/khmer
-  mailing list for **announcements**:
   http://lists.idyll.org/listinfo/khmer-announce
-  email contact for project maintainers:
   khmer-project@idyll.org

Important note: cite us!
------------------------

khmer is *research software*, so you should cite us when you use it in scientific publications!
Please see the `CITATION `__ file for citation information.

The khmer library is a project of the `Lab for Data Intensive Biology `__ at UC Davis, and includes contributions from its members, collaborators, and friends.

Quick install
-------------

::

    pip install khmer
    pytest --pyargs khmer -m 'not known_failing and not jenkins and not huge and not linux'

See https://khmer.readthedocs.io/en/stable/user/install.html for more detailed installation instructions.

Contributing
------------

We welcome contributions to khmer from the community!
If you're interested in modifying khmer or contributing to its ongoing development see https://khmer.readthedocs.io/en/stable/dev/getting-started.html.

.. |Research software impact| image:: http://depsy.org/api/package/pypi/khmer/badge.svg
   :target: http://depsy.org/package/python/khmer
.. |Supported Python versions| image:: https://img.shields.io/pypi/pyversions/khmer.svg
.. |khmer build status| image:: https://img.shields.io/travis/dib-lab/khmer.svg
   :target: https://travis-ci.org/dib-lab/khmer
.. |Test coverage| image:: https://img.shields.io/codecov/c/github/dib-lab/khmer.svg
   :target: https://codecov.io/github/dib-lab/khmer
.. |BSD-3 licensed| image:: https://img.shields.io/badge/license-BSD%203--Clause-blue.svg
   :target: https://github.com/dib-lab/khmer/blob/master/LICENSE

Owner

  • Name: The Lab for Data Intensive Biology
  • Login: dib-lab
  • Kind: organization
  • Location: University of California, Davis, School of Veterinary Medicine, Davis, California, USA

Previously "Genomics, Evolution, and Development Lab" @ Michigan State University

JOSS Publication

khmer release v2.1: software for biological sequence analysis
Published
July 03, 2017
Volume 2, Issue 15, Page 272
Authors
Daniel Standage ORCID
Lab for Data Intensive Biology; School of Veterinary Medicine; University of California, Davis
Ali Aliyari ORCID
Integrative Genetics and Genomics Graduate Group; University of California, Davis
Lisa J. Cohen ORCID
Lab for Data Intensive Biology; School of Veterinary Medicine; University of California, Davis, Molecular, Cellular, and Integrative Physiology Graduate Group; University of California, Davis
Michael R. Crusoe ORCID
Common Workflow Language Project
Tim Head ORCID
Wild Tree Tech
Luiz Irber ORCID
Lab for Data Intensive Biology; School of Veterinary Medicine; University of California, Davis, Computer Science Graduate Group; University of California, Davis
Shannon Ek Joslin ORCID
Integrative Genetics and Genomics Graduate Group; University of California, Davis
N. B. Kingsley ORCID
Integrative Genetics and Genomics Graduate Group; University of California, Davis
Kevin D. Murray ORCID
ARC Centre of Excellence in Plant Energy Biology; Australian National University
Russell Neches ORCID
Microbiology Graduate Group; Univerity of California, Davis
Camille Scott ORCID
Lab for Data Intensive Biology; School of Veterinary Medicine; University of California, Davis, Computer Science Graduate Group; University of California, Davis
Ryan Shean
University of Washington
Sascha Steinbiss ORCID
Debian Project
Cait Sydney
Google, Inc.
C. Titus Brown ORCID
Lab for Data Intensive Biology; School of Veterinary Medicine; University of California, Davis, Department of Population Health and Reproduction; School of Veterinary Medicine; University of California, Davis
Editor
George Githinji
Tags
bioinformatics sequence analysis Bloom filter Count-Min sketch de Bruijn graph assembly graph traversal streaming quality control

Citation (CITATION)

..
   This file is part of khmer, https://github.com/dib-lab/khmer/, and is
   Copyright (C) 2014-2015 Michigan State University
   Copyright (C) 2015 The Regents of the University of California.
   It is licensed under the three-clause BSD license; see LICENSE.
   Contact: khmer-project@idyll.org

   Redistribution and use in source and binary forms, with or without
   modification, are permitted provided that the following conditions are
   met:

    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above
      copyright notice, this list of conditions and the following
      disclaimer in the documentation and/or other materials provided
      with the distribution.

    * Neither the name of the Michigan State University nor the names
      of its contributors may be used to endorse or promote products
      derived from this software without specific prior written
      permission.

   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

   Contact: khmer-project@idyll.org

.. If you update this file then you may need to update the citations in
   khmer/khmer_args.py as well

*********
Citations
*********

Software Citation
=================

If you use the khmer software, you must cite:

   Crusoe et al., The khmer software package: enabling efficient nucleotide
   sequence analysis. 2015. https://doi.org/10.12688/f1000research.6924.1

.. code-block:: tex

  @article{khmer2015,
     author = "Crusoe, Michael R. and Alameldin, Hussien F. and Awad, Sherine
  and Bucher, Elmar and Caldwell, Adam and Cartwright, Reed and Charbonneau,
  Amanda and Constantinides, Bede and Edvenson, Greg and Fay, Scott and Fenton,
  Jacob and Fenzl, Thomas and Fish, Jordan and Garcia-Gutierrez, Leonor and
  Garland, Phillip and Gluck, Jonathan and González, Iván and Guermond, Sarah
  and Guo, Jiarong and Gupta, Aditi and Herr, Joshua R. and Howe, Adina and
  Hyer, Alex and Härpfer, Andreas and Irber, Luiz and Kidd, Rhys and Lin, David
  and Lippi, Justin and Mansour, Tamer and McA'Nulty, Pamela and McDonald, Eric
  and Mizzi, Jessica and Murray, Kevin D. and Nahum, Joshua R. and Nanlohy,
  Kaben and Nederbragt, Alexander Johan and Ortiz-Zuazaga, Humberto and Ory,
  Jeramia and Pell, Jason and Pepe-Ranney, Charles and Russ, Zachary N and
  Schwarz, Erich and Scott, Camille and Seaman, Josiah and Sievert, Scott and
  Simpson, Jared and Skennerton, Connor T. and Spencer, James and Srinivasan,
  Ramakrishnan and Standage, Daniel and Stapleton, James A. and Stein, Joe and
  Steinman, Susan R and Taylor, Benjamin and Trimble, Will and Wiencko, Heather
  L. and Wright, Michael and Wyss, Brian and Zhang, Qingpeng and zyme, en and
  Brown, C. Titus"
     title = "The khmer software package: enabling efficient nucleotide
  sequence analysis",
     year = "2015",
     month = "08",
     publisher = "F1000",
     url = "https://doi.org/10.12688/f1000research.6924.1"
  }

If you use any of our published scientific methods you should *also*
cite the relevant paper(s) as directed below. Additionally some scripts use
the `SeqAn library <http://www.seqan.de>`_ for read parsing: the full citation
for that library is also included below.

To see a quick summary of papers for a given script just run it without using
any command line arguments.

Graph partitioning and/or compressible graph representation
===========================================================

The :program:`load-graph.py`, :program:`partition-graph.py`,
and :program:`find-knots.py` scripts are part of the compressible graph
representation and partitioning algorithms described in:

   Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT.
   Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
   Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7.
   https://doi.org/10.1073/pnas.1121464109.
   PMID: 22847406

.. code-block:: tex

  @article{Pell2012,
      author = "Pell, Jason and Hintze, Arend and Canino-Koning, Rosangela and
  Howe, Adina and Tiedje, James M. and Brown, C. Titus",
      title = "Scaling metagenome sequence assembly with probabilistic de Bruijn
  graphs",
      volume = "109",
      number = "33",
      pages = "13272-13277",
      year = "2012",
      doi = "10.1073/pnas.1121464109",
      abstract ="Deep sequencing has enabled the investigation of a wide range of
  environmental microbial ecosystems, but the high memory requirements for de
  novo assembly of short-read shotgun sequencing data from these complex
  populations are an increasingly large practical barrier. Here we introduce a
  memory-efficient graph representation with which we can analyze the k-mer
  connectivity of metagenomic samples. The graph representation is based on a
  probabilistic data structure, a Bloom filter, that allows us to efficiently
  store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We
  show that this data structure accurately represents DNA assembly graphs in low
  memory. We apply this data structure to the problem of partitioning assembly
  graphs into components as a prelude to assembly, and show that this reduces the
  overall memory requirements for de novo assembly of metagenomes. On one soil
  metagenome assembly, this approach achieves a nearly 40-fold decrease in the
  maximum memory requirements for assembly. This probabilistic graph
  representation is a significant theoretical advance in storing assembly graphs
  and also yields immediate leverage on metagenomic assembly.",
      URL = "http://www.pnas.org/content/109/33/13272.abstract",
      eprint = "http://www.pnas.org/content/109/33/13272.full.pdf+html",
      journal = "Proceedings of the National Academy of Sciences"
  }

Digital normalization
=====================

The :program:`normalize-by-median.py` and :program:`count-median.py` scripts
are part of the digital normalization algorithm, described in:

   A Reference-Free Algorithm for Computational Normalization of
   Shotgun Sequencing Data
   Brown CT, Howe AC, Zhang Q, Pyrkosz AB, Brom TH
   arXiv:1203.4802 [q-bio.GN]
   http://arxiv.org/abs/1203.4802

.. code-block:: tex

  @unpublished{diginorm,
      author = "C. Titus Brown and Adina Howe and Qingpeng Zhang and Alexis B.
  Pyrkosz and Timothy H. Brom",
      title = "A Reference-Free Algorithm for Computational Normalization of
  Shotgun Sequencing Data",
      year = "2012",
      eprint = "arXiv:1203.4802",
      url = "http://arxiv.org/abs/1203.4802",
  }

Efficient k-mer error trimming
==============================

The :program:`script trim-low-abund.py` is described in:

   Crossing the streams: a framework for streaming analysis of short DNA
   sequencing reads
   Zhang Q, Awad S, Brown CT
   https://doi.org/10.7287/peerj.preprints.890v1

.. code-block:: tex

  @unpublished{semistream,
      author = "Qingpeng Zhang and Sherine Awad and C. Titus Brown",
      title = "Crossing the streams: a framework for streaming analysis of
          short DNA sequencing reads",
      year = "2015",
      eprint = "PeerJ Preprints 3:e1100",
      url = "https://doi.org/10.7287/peerj.preprints.890v1"
  }

K-mer counting
==============

The :program:`abundance-dist.py`, :program:`filter-abund.py`, and
:program:`load-into-counting.py` scripts implement the probabilistic k-mer
counting described in:

   These Are Not the K-mers You Are Looking For: Efficient Online K-mer
   Counting Using a Probabilistic Data Structure
   Zhang Q, Pell J, Canino-Koning R, Howe AC, Brown CT.
   https://doi.org/10.1371/journal.pone.0101271

.. code-block:: tex

  @article{khmer-counting,
      author = "Zhang, Qingpeng AND Pell, Jason AND Canino-Koning, Rosangela
  AND Howe, Adina Chuang AND Brown, C. Titus",
      journal = "PLoS ONE",
      publisher = "Public Library of Science",
      title = "These Are Not the K-mers You Are Looking For: Efficient Online
  K-mer Counting Using a Probabilistic Data Structure",
      year = "2014",
      month = "07",
      volume = "9",
      url = "https://doi.org/10.1371/journal.pone.0101271",
      pages = "e101271",
      abstract = "<p>K-mer abundance analysis is widely used for many purposes in
  nucleotide sequence analysis, including data preprocessing for de novo
  assembly, repeat detection, and sequencing coverage estimation. We present the
  khmer software package for fast and memory efficient <italic>online</italic>
  counting of k-mers in sequencing data sets. Unlike previous methods based on
  data structures such as hash tables, suffix arrays, and trie structures, khmer
  relies entirely on a simple probabilistic data structure, a Count-Min Sketch.
  The Count-Min Sketch permits online updating and retrieval of k-mer counts in
  memory which is necessary to support online k-mer analysis algorithms. On
  sparse data sets this data structure is considerably more memory efficient than
  any exact data structure. In exchange, the use of a Count-Min Sketch introduces
  a systematic overcount for k-mers; moreover, only the counts, and not the
  k-mers, are stored. Here we analyze the speed, the memory usage, and the
  miscount rate of khmer for generating k-mer frequency distributions and
  retrieving k-mer counts for individual k-mers. We also compare the performance
  of khmer to several other k-mer counting packages, including Tallymer,
  Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the
  effectiveness of profiling sequencing error, k-mer abundance trimming, and
  digital normalization of reads in the context of high khmer false positive
  rates. khmer is implemented in C++ wrapped in a Python interface, offers a
  tested and robust API, and is freely available under the BSD license at
  github.com/dib-lab/khmer.</p>",
      number = "7",
      doi = "10.1371/journal.pone.0101271"
  }

FASTA and FASTQ reading
=======================

Several scripts use the SeqAn library for FASTQ and FASTA reading as described
in:

   SeqAn An efficient, generic C++ library for sequence analysis
   Döring A, Weese D, Rausch T, Reinert K.
   https://doi.org/10.1186/1471-2105-9-11

.. code-block:: tex

  @Article{SeqAn,
    AUTHOR = {Doring, Andreas and Weese, David and Rausch, Tobias and Reinert,
      Knut},
    TITLE = {SeqAn An efficient, generic C++ library for sequence analysis},
    JOURNAL = {BMC Bioinformatics},
    VOLUME = {9},
    YEAR = {2008},
    NUMBER = {1},
    PAGES = {11},
    URL = {http://www.biomedcentral.com/1471-2105/9/11},
    DOI = {10.1186/1471-2105-9-11},
    PubMedID = {18184432},
    ISSN = {1471-2105},
    ABSTRACT = {BACKGROUND: The use of novel algorithmic techniques is pivotal
    to many important problems in life science. For example the sequencing of
    the human genome [1] would not have been possible without advanced assembly
    algorithms. However, owing to the high speed of technological progress and
    the urgent need for bioinformatics tools, there is a widening gap between
    state-of-the-art algorithmic techniques and the actual algorithmic
    components of tools that are in widespread use. RESULTS: To remedy this
    trend we propose the use of SeqAn, a library of efficient data types and
    algorithms for sequence analysis in computational biology. SeqAn comprises
    implementations of existing, practical state-of-the-art algorithmic
    components to provide a sound basis for algorithm testing and development.
    In this paper we describe the design and content of SeqAn and demonstrate
    its use by giving two examples. In the first example we show an application
    of SeqAn as an experimental platform by comparing different exact string
    matching algorithms. The second example is a simple version of the well-
    known MUMmer tool rewritten in SeqAn. Results indicate that our
    implementation is very efficient and versatile to use. CONCLUSION: We
    anticipate that SeqAn greatly simplifies the rapid development of new
    bioinformatics tools by providing a collection of readily usable, well-
    designed algorithmic components which are fundamental for the field of
    sequence analysis. This leverages not only the implementation of new
    algorithms, but also enables a sound analysis and comparison of existing
    algorithms.},
  }

.. vim: set filetype=rst:

GitHub Events

Total
  • Watch event: 21
  • Fork event: 4
Last Year
  • Watch event: 21
  • Fork event: 4

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 5,159
  • Total Committers: 87
  • Avg Commits per committer: 59.299
  • Development Distribution Score (DDS): 0.635
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
C. Titus Brown t****s@i****g 1,882
Michael R. Crusoe m****e@m****u 738
Camille Scott c****w@g****m 432
Tim Head b****m@g****m 299
Jacob Fenton b****f@g****m 249
Daniel Standage d****e@g****m 246
Eric McDonald em@m****u 232
Kevin D. Murray k****y@a****u 186
Luiz Irber l****r@g****m 186
Jason Pell j****l@g****m 70
Qingpeng Zhang q****g@g****m 50
Tamer Mansour d****r@g****m 50
Jordan Fish j****h@g****m 41
Rhys Kidd r****d@g****m 35
Jessica Mizzi m****s@m****u 29
Scott Fay s****y@g****m 25
Michael Wright w****7@g****m 24
Susan R Steinman s****n@g****m 22
Justin Lippi j****i@g****m 21
Adina Howe a****a@i****u 17
Ramakrishnan Srinivasan r****s@n****u 17
Thomas Fenzl t****l@g****t 16
Sarah Guermond s****d@g****m 16
Elmar Bucher b****e@o****u 16
Sherine Awad d****d@u****u 13
Shannon EK Joslin s****n@u****u 12
Andreas Härpfer a****r@g****m 12
Leonor Garcia-Gutierrez l****z@w****k 11
Will Trimble t****e@a****v 9
Benjamin Taylor t****6@m****u 9
and 57 more...

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 70
  • Total pull requests: 48
  • Average time to close issues: 6 months
  • Average time to close pull requests: 11 months
  • Total issue authors: 36
  • Total pull request authors: 19
  • Average comments per issue: 2.66
  • Average comments per pull request: 2.98
  • Merged pull requests: 22
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 4.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • standage (9)
  • ctb (8)
  • olgabot (6)
  • mr-c (5)
  • Canadatechmanufacturing (3)
  • fungs (3)
  • dkoslicki (2)
  • solshiferaw (2)
  • zmunro (2)
  • taranglute (2)
  • khuch123 (2)
  • bzvew (2)
  • ZhuangZK (1)
  • bwlang (1)
  • JuantonioMS (1)
Pull Request Authors
  • standage (18)
  • shokrof (7)
  • mr-c (5)
  • luizirber (3)
  • camillescott (2)
  • ctSkennerton (1)
  • katrinleinweber (1)
  • lakshayg (1)
  • moorepants (1)
  • jts (1)
  • olgabot (1)
  • drtamermansour (1)
  • ctb (1)
  • Gnojoomi (1)
  • sanchestm (1)
Top Labels
Issue Labels
low-hanging-fruit (7) Documentation (4) enhancement (2) Python (2) theme:customer-oriented (1) Cython (1) C++ (1) discussion-needed (1) upstream bug (1) theme:best-practices (1) bug (1)
Pull Request Labels
Ready For Review and Merge! (12) Cython (2) Documentation (2) C++ (1) Work In Progress (1) bug (1) discussion-needed (1)

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 1,545 last-month
  • Total docker downloads: 45
  • Total dependent packages: 2
    (may contain duplicates)
  • Total dependent repositories: 20
    (may contain duplicates)
  • Total versions: 31
  • Total maintainers: 4
pypi.org: khmer

khmer k-mer counting library

  • Versions: 17
  • Dependent Packages: 2
  • Dependent Repositories: 20
  • Downloads: 1,545 Last month
  • Docker Downloads: 45
Rankings
Docker downloads count: 1.3%
Dependent repos count: 3.3%
Dependent packages count: 4.7%
Average: 6.1%
Downloads: 15.3%
Last synced: 4 months ago
proxy.golang.org: github.com/dib-lab/khmer
  • Versions: 14
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 6.5%
Average: 6.7%
Dependent repos count: 6.9%
Last synced: 4 months ago

Dependencies

docker/Dockerfile docker
  • debian stable build
doc/requirements.txt pypi
  • guzzle_sphinx_theme ==0.7.11
  • setuptools >=3.4.1
  • sphinxcontrib-autoprogram >=0.1.4
setup.py pypi