poppunk

PopPUNK πŸ‘¨β€πŸŽ€ (POPulation Partitioning Using Nucleotide Kmers)

https://github.com/bacpop/poppunk

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • βœ“
    CITATION.cff file
    Found CITATION.cff file
  • βœ“
    codemeta.json file
    Found codemeta.json file
  • βœ“
    .zenodo.json file
    Found .zenodo.json file
  • βœ“
    DOI references
    Found 7 DOI reference(s) in README
  • β—‹
    Academic publication links
  • βœ“
    Committers with academic emails
    5 of 14 committers (35.7%) from academic institutions
  • β—‹
    Institutional organization owner
  • β—‹
    JOSS paper metadata
  • β—‹
    Scientific vocabulary similarity
    Low similarity (14.9%) to scientific vocabulary

Keywords

bacteria genomics k-mer population-genetics sketching
Last synced: 4 months ago · JSON representation ·

Repository

PopPUNK πŸ‘¨β€πŸŽ€ (POPulation Partitioning Using Nucleotide Kmers)

Basic Info
Statistics
  • Stars: 100
  • Watchers: 6
  • Forks: 19
  • Open Issues: 33
  • Releases: 36
Topics
bacteria genomics k-mer population-genetics sketching
Created about 8 years ago · Last pushed 5 months ago
Metadata Files
Readme License Citation Roadmap

README.md

POPulation Partitioning Using Nucleotide Kmers

Dev build Status Run tests Build and publish docs Anaconda package PyPI version <!-- badges: end -->

Description

Links: - Documentation - Databases - Paper

If you find PopPUNK useful, please cite us:

Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Research 29:304-316 (2019). doi:10.1101/gr.241455.118

You can also run your command with --citation to get a list of citations and a suggested methods paragraph.

News and roadmap

The roadmap can be found in the documentation.

2024-08-07

PopPUNK 2.7.0 comes with two changes: - Distance matrices <db_name>.dists.npy are no longer required or written when using poppunk_assign, with or without --update-db. These can be very large, especially with many samples, so this saves space and memory in model reuse and distribution. Note that the <db_name>.dists.pkl file is still required (but this is small). - We have added a --stable flag to poppunk_assign. Rather than merging hybrid clusters, new samples will simply be assigned to their nearest neighbour. This implies --serial and cannot be run with --update-db. This behaviour mimics the 'stable nomenclature' of schemes such as LIN.

2023-01-18

We have retired the PopPUNK website. Databases have been expanded, and can be found here: https://www.bacpop.org/poppunk-databases/.

2022-08-04

The change in scikit-learn's API in v1.0.0 and above mean that HDBSCAN models fitted with sklearn <=v0.24 will give an error when loaded. If you run into this, the solution is one of: - Downgrade sklearn to v0.24. - Run model refinement to turn your model into a boundary model instead (this will change clusters). - Refit your model in an environment with sklearn >=v1.0.

If this is a common problem let us know, as we could write a script to 'upgrade' HDBSCAN models. See issue #213 for more details.

2021-03-15

We have fixed a number of bugs with may affect the use of poppunk_assign with --update-db. We have also fixed a number of bugs with GPU distances. These are 'advanced' features and are not likely to be encountered in most cases, but if you do wish to use either of these features please make sure that you are using PopPUNK >=v2.4.0 with pp-sketchlib >=v1.7.0.

2020-09-30

We have discovered a bug affecting the interaction of pp-sketchlib and PopPUNK. If you have used PopPUNK >=v2.0.0 with pp-sketchlib <v1.5.1 label order may be incorrect (see issue #95).

Please upgrade to PopPUNK >=v2.2 and pp-sketchlib >=v1.5.1. If this is not possible, you can either: - Run scripts/poppunk_pickle_fix.py on your .dists.pkl file and re-run model fits. - Create the database with poppunk_sketch directly, rather than PopPUNK --create-db

Installation

This is for the command line version. For more details see installation in the documentation.

Our (beta) web interface BeeBOP is now also available: https://beebop.dide.ic.ac.uk/

Through conda (recommended)

The easiest way is through conda, which is most easily accessed by first installing miniconda. PopPUNK can then be installed by running: conda install poppunk If the package cannot be found you will need to add the necessary channels: conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge

Quick usage

See the overview first. There are two ways of running:

With a supported species

1) Download an existing database. 2) Run assignment.

With a new species.

1) Create sketches of input. 2) Run QC. 3) Build a model.

Docker image

A docker image is available

docker pull mrcide/poppunk:bacpop-20

Owner

  • Name: Bacterial population genetics
  • Login: bacpop
  • Kind: organization
  • Email: contact@bacpop.org
  • Location: United Kingdom

Pathogen Informatics and Modelling @ EMBL-EBI / Bacterial Evolutionary Epidemiology Group @ Imperial College London

Citation (CITATION.bib)

@ARTICLE{Lees2019-tw,
  title    = "Fast and flexible bacterial genomic epidemiology with {PopPUNK}",
  author   = "Lees, John A and Harris, Simon R and Tonkin-Hill, Gerry and
              Gladstone, Rebecca A and Lo, Stephanie W and Weiser, Jeffrey N
              and Corander, Jukka and Bentley, Stephen D and Croucher, Nicholas
              J",
  abstract = "The routine use of genomics for disease surveillance provides the
              opportunity for high-resolution bacterial epidemiology. Current
              whole-genome clustering and multilocus typing approaches do not
              fully exploit core and accessory genomic variation, and they
              cannot both automatically identify, and subsequently expand,
              clusters of significantly similar isolates in large data sets
              spanning entire species. Here, we describe PopPUNK (Population
              Partitioning Using Nucleotide K -mers), a software implementing
              scalable and expandable annotation- and alignment-free methods
              for population analysis and clustering. Variable-length k-mer
              comparisons are used to distinguish isolates' divergence in
              shared sequence and gene content, which we demonstrate to be
              accurate over multiple orders of magnitude using data from both
              simulations and genomic collections representing 10 taxonomically
              widespread species. Connections between closely related isolates
              of the same strain are robustly identified, despite interspecies
              variation in the pairwise distance distributions that reflects
              species' diverse evolutionary patterns. PopPUNK can process
              103-104 genomes in a single batch, with minimal memory use and
              runtimes up to 200-fold faster than existing model-based methods.
              Clusters of strains remain consistent as new batches of genomes
              are added, which is achieved without needing to reanalyze all
              genomes de novo. This facilitates real-time surveillance with
              consistent cluster naming between studies and allows for outbreak
              detection using hundreds of genomes in minutes. Interactive
              visualization and online publication is streamlined through the
              automatic output of results to multiple platforms. PopPUNK has
              been designed as a flexible platform that addresses important
              issues with currently used whole-genome clustering and typing
              methods, and has potential uses across bacterial genetics and
              public health research.",
  journal  = "Genome Res.",
  volume   =  29,
  number   =  2,
  pages    = "304--316",
  month    =  jan,
  year     =  2019,
  language = "en"
}

GitHub Events

Total
  • Create event: 14
  • Release event: 4
  • Issues event: 20
  • Watch event: 8
  • Delete event: 5
  • Issue comment event: 39
  • Push event: 96
  • Pull request event: 17
  • Pull request review comment event: 27
  • Pull request review event: 28
  • Fork event: 2
Last Year
  • Create event: 14
  • Release event: 4
  • Issues event: 20
  • Watch event: 8
  • Delete event: 5
  • Issue comment event: 39
  • Push event: 96
  • Pull request event: 17
  • Pull request review comment event: 27
  • Pull request review event: 28
  • Fork event: 2

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 1,944
  • Total Committers: 14
  • Avg Commits per committer: 138.857
  • Development Distribution Score (DDS): 0.457
Past Year
  • Commits: 99
  • Committers: 5
  • Avg Commits per committer: 19.8
  • Development Distribution Score (DDS): 0.525
Top Committers
Name Email Commits
nickjcroucher n****r@i****k 1,055
John Lees l****6@g****m 777
Danderson123 d****1@h****m 24
Croucher n****e@i****e 23
muppi1993 c****0@i****k 18
Bin Zhao 4****5 17
Rich FitzJohn r****n@i****k 9
Daniel Anderson 4****3 6
Harry Hung 4****g 4
muppi1993 7****3 4
Sam Horsfield s****9@i****k 3
Nicholas Croucher n****3@s****k 2
Jason Stajich j****d@g****m 1
Tommi MΓ€klin t****i@m****i 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 7
  • Total pull requests: 9
  • Average time to close issues: 4 days
  • Average time to close pull requests: about 20 hours
  • Total issue authors: 6
  • Total pull request authors: 4
  • Average comments per issue: 0.43
  • Average comments per pull request: 0.0
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 7
  • Pull requests: 9
  • Average time to close issues: 4 days
  • Average time to close pull requests: about 20 hours
  • Issue authors: 6
  • Pull request authors: 4
  • Average comments per issue: 0.43
  • Average comments per pull request: 0.0
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • johnlees (3)
  • DOH-JDJ0303 (3)
  • luciagrami (2)
  • erinyoung (2)
  • zjx22018105-coder (1)
  • RyanCFink (1)
  • drhoads (1)
  • ayabtg (1)
  • tanzhizhou (1)
  • Jamesped (1)
  • nermze (1)
  • HarryHung (1)
  • fgonzalez3 (1)
  • RuwiniK (1)
  • rderelle (1)
Pull Request Authors
  • nickjcroucher (16)
  • absternator (6)
  • johnlees (3)
  • samhorsfield96 (2)
  • tgttunstall (1)
  • ERBringHorvath (1)
Top Labels
Issue Labels
question (2) enhancement (1) bug (1) model (1)
Pull Request Labels
enhancement (1)

Dependencies

environment.yml conda
  • boost-cpp
  • cmake >=3.18
  • dendropy >=4.4.0
  • eigen
  • flask
  • flask-apscheduler
  • flask-cors
  • graph-tool >=2.35
  • gunicorn
  • h5py
  • hdbscan
  • libgfortran-ng
  • libgomp
  • matplotlib
  • matplotlib-base
  • networkx
  • numpy
  • openblas
  • pandas
  • pip
  • pp-sketchlib >=1.7.0
  • pybind11
  • python-dateutil
  • rapidnj
  • requests
  • scikit-learn >=0.24
  • scipy
  • tqdm
  • treeswift
  • tzlocal <3.0
  • xorg-libxaw
  • xorg-libxcomposite
  • xorg-libxcursor
  • xorg-libxdamage
  • xorg-libxfixes
  • xorg-libxi
  • xorg-libxinerama
  • xorg-libxpm
  • xorg-libxrandr
docs/requirements.txt pypi
  • Cython >=0.26.1
  • docutils <0.18
.github/workflows/acr_push.yml actions
  • actions/checkout master composite
  • azure/docker-login v1 composite
  • azure/login v1 composite
.github/workflows/azure_ci.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • mamba-org/provision-with-micromamba main composite
.github/workflows/docker_push.yml actions
  • actions/checkout v2 composite
  • docker/build-push-action v2 composite
  • docker/login-action v1 composite
  • docker/setup-buildx-action v1 composite
  • docker/setup-qemu-action v1 composite
boost/Dockerfile docker
  • ubuntu 20.04 build
docker/Dockerfile docker
  • python 3.10 build
setup.py pypi
  • biopython *
  • h5py *
  • hdbscan *
  • mandrake *
  • matplotlib *
  • networkx *
  • pandas *
  • pp-sketchlib *
  • requests *
  • scikit-learn *
  • tqdm *
  • treeswift *