bcgtree

Automatically calculate phylogenetic trees from bacterial core genes

https://github.com/molbiodiv/bcgtree

Science Score: 65.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 14 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization molbiodiv has institutional domain (www.dna-analytics.biozentrum.uni-wuerzburg.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.7%) to scientific vocabulary

Keywords

bioinformatics phylogenetics
Last synced: 6 months ago · JSON representation ·

Repository

Automatically calculate phylogenetic trees from bacterial core genes

Basic Info
  • Host: GitHub
  • Owner: molbiodiv
  • License: mit
  • Language: Perl
  • Default Branch: master
  • Size: 19.2 MB
Statistics
  • Stars: 20
  • Watchers: 4
  • Forks: 9
  • Open Issues: 5
  • Releases: 14
Topics
bioinformatics phylogenetics
Created over 10 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.org

* bcgTree
Automatized phylogenetic tree building from bacterial core genomes.

[[doc/bcgTree.png]]

An article describing bcgTree is published in [[http://www.nrcresearchpress.com/doi/abs/10.1139/gen-2015-0175][Genome]].
If you use bcgTree for your research please cite: [[http://dx.doi.org/10.1139/gen-2015-0175][https://img.shields.io/badge/DOI-10.1139%2Fgen--2015--0175-blue.svg]]

Also please cite the external programs and the source of the HMMs (see Background section) if you use the default essential.hmm

If you like the tool consider voting for it on [[https://labworm.com/tool/bcgtree][LabWorm]].

See [[file:reproduce_results.org][this file]] for instructions on how to reproduce results from our article.
** Dependencies
Please note that some of the dependencies have their own Licenses.
*** Perl
To execute bcgTree [[https://www.perl.org/][perl5]] is required along with the following modules (should be part of the core installation, but also available via [[http://www.cpan.org/][cpan]]):
 - Getopt::Long
 - Pod::Usage
 - FindBin
 - File::Path
 - File::Spec
The following modules are included in this repo:
 - [[http://search.cpan.org/~mschilli/Log-Log4perl-1.46/lib/Log/Log4perl.pm][Log::Log4perl]] ([[file:lib/Log-Log4perl-1.46/LICENSE][Artistic License]])
 - [[http://search.cpan.org/~jstenzel/Getopt-ArgvFile-1.11/ArgvFile.pm][Getopt::ArgvFile]] ([[file:lib/Getopt-ArgvFile-1.11/README][Artistic License]])
 - [[https://github.com/BioInf-Wuerzburg/perl5lib-Fasta][perl5lib-Fasta]] ([[file:lib/perl5lib-Fasta/LICENSE][MIT License]])
 - [[https://github.com/BioInf-Wuerzburg/perl5lib-Fastq][perl5lib-Fastq]] ([[file:lib/perl5lib-Fastq/LICENSE][MIT License]])
 - [[https://github.com/BioInf-Wuerzburg/perl5lib-Verbose][perl5lib-Verbose]] ([[file:lib/perl5lib-Verbose/LICENSE][MIT License]])
*** Java
To use the graphical user interface a Java Runtime Environment ([[http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html][JRE]], tested with version 8) is required.
The following extra frameworks are used (included):
 - [[https://swingx.java.net/][SwingX]] 1.6.5 ([[http://www.gnu.org/copyleft/lesser.html][LGPL]])
*** External programs
bcgTree is a wrapper around multiple existing tools.
The following external programs are called by bcgTree and have to be installed (~prodigal~ is optional, it is only needed if you provide nucleotide sequences using ~--genome~).
The specified versions are the ones we used for testing (older versions might or might not work).
Newer versions should work (otherwise feel free to open an issue).
 - [[http://hmmer.org/][hmmsearch]] (HMMER version 3.3) - Eddy et al, 2010. HMMER3: a new generation of sequence homology search software.
 - [[https://github.com/rcedgar/muscle/releases/][muscle]] (v5.3) - Edgar RC, 2021, MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping, [bioRxiv 2021.06.20.449169](https://doi.org/10.1101/2021.06.20.449169). NOTE: please use Version 5+ (v3 is no longer supported in bcgTree v1.3.0 and later)
 - [[http://molevol.cmima.csic.es/castresana/Gblocks.html][Gblocks]] (version 0.91b) - Castresana et al, 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552.
 - [[http://sco.h-its.org/exelixis/web/software/raxml/][RAxML]] (version 8.2.12) - Stamatakis et al, 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313.
 - [[https://github.com/hyattpd/Prodigal][prodigal]] (version 2.6.3) - Hyatt et al, 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119. https://doi.org/10.1186/1471-2105-11-119
Additionally [[https://github.com/BioInf-Wuerzburg/SeqFilter][SeqFilter]] (Hackl et al, 2014. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30, 3004-3011) is needed but it is included in this repo.
*** Data
There are two hmm files included in the data directory.
Both contain HMMs from [[ftp://ftp.tigr.org/pub/data/TIGRFAMs][TIGRFAM]] ([[ftp://ftp.tigr.org/pub/data/TIGRFAMs/COPYRIGHT][LGPL]]) and [[https://pfam.xfam.org/][PFAM]] ([[ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/relnotes.txt][CC0]]).
The ubcg.hmm file is reproduced from [[https://www.ezbiocloud.net/tools/ubcg][UBCG]] as described in [[https://doi.org/10.1007/s12275-018-8014-6][Na, SI., Kim, Y.O., Yoon, SH. et al. J Microbiol. (2018) 56: 280.]]
** Installation
You can download the latest release from [[https://github.com/molbiodiv/bcgTree/releases][here]].
After extraction you can start the graphical interface by double clicking the bcgTree.jar file in the bcgTreeGUI folder.
Please remember to install the required external programms (see above).
Alternatively you can start the perl script bcgTree.pl in the bin folder from the command line.
Using the command line interface is the recommended way.
You can also clone the bcgTree git repository by executing the following commands:
#+BEGIN_SRC sh
git clone https://github.com/molbiodiv/bcgTree.git
# Now you can run bcgTree.pl
bcgTree/bin/bcgTree.pl --help
# Or start the java GUI
cd bcgTree/bcgTreeGUI
java -jar bcgTree.jar
#+END_SRC
** Usage
*** GUI
The graphical user interface is just a convenient way to call bcgTree.
Usage is meant to be self-explanatory, any suggestions for improvement are welcome (please open an [[https://github.com/molbiodiv/bcgTree/issues][issue]]).
In fact it adds no extra functionality it just collects the parameters and calls the perl script on the command line.
The output is written to a text field as well as a log file (bcgTree.log in the output folder).
In order to preserve replicability all parameters are written in a file called options.txt in the output folder.
The internal call of bcgTree is then just:
#+BEGIN_SRC sh
bcgTree.pl @outdir/options.txt
#+END_SRC
If you feel comfortable using the command line, calling the perl script directly is still the recommendet way to use bcgTree.
Nevertheless here are some screenshots of the GUI:
#+ATTR_HTML: :width 640
[[doc/screenshot0.png]]
#+ATTR_HTML: :width 640
[[doc/screenshot1.png]]
#+ATTR_HTML: :width 640
[[doc/screenshot2.png]]
*** Command line
#+BEGIN_SRC sh
Usage:
      $ bcgTree.pl [@configfile] --proteome bac1=bacterium1.pep.fa --proteome bac2=bacterium2.faa [options]

Options:
    [@configfile]            Optional path to a configfile with @ as prefix.
                             Config files consist of command line parameters
                             and arguments just as passed on the command
                             line. Space and comment lines are allowed (and
                             ignored). Spreading over multiple lines is
                             supported.

    --proteome = [--proteome = ..]
                             Multiple pairs of organism and proteomes as
                             peptide fasta file paths Attention: If you
                             provide a proteome and genome with the same
                             name, only the genome will be used.

    --genome = [--genome = ..]
                             Multiple pairs of organism and genomes as
                             nucleotide fasta file paths. Attention: If you
                             provide a proteome and genome with the same
                             name, only the genome will be used.

    [--outdir ]      output directory for the generated output files
                             (default: bcgTree)

    [--help]                 show help

    [--version]              show version number of bcgTree and exit

    [--check-external-programs]
                             Check if all of the required external programs
                             can be found and are executable, then exit.
                             Report table with program, status (ok or
                             !fail!) and path. If all external programs are
                             found exit code is 0 otherwise 1. Note that
                             this parameter does not check that the paths
                             belong to the actual programs, it only checks
                             that the given locations are executable files.

    [--hmmsearch-bin=] Path to hmmsearch binary file. Default tries if
                             hmmsearch is in PATH;

    [--muscle-bin=]    Path to muscle binary file. Default tries if
                             muscle is in PATH;

    [--gblocks-bin=]   Path to the Gblocks binary file. Default tries
                             if Gblocks is in PATH;

    [--raxml-bin=]     Path to the raxml binary file. Default tries if
                             raxmlHPC is in PATH;

    [--prodigal-bin=]
        Path to the prodigal binary file. Default tries if prodigal is in
        PATH;

    [--threads=]
        Number of threads to be used (currently only relevant for raxml).
        Default: 2 From the raxml man page: PTHREADS VERSION ONLY! Specify
        the number of threads you want to run. Make sure to set "-T" to at
        most the number of CPUs you have on your machine, otherwise, there
        will be a huge performance decrease!

    [--bootstraps=]
        Number of bootstraps to be used (passed to raxml). Default: 100

    [--min-proteomes=]
        Minimum number of proteomes in which a gene must occur in order to
        be kept. Default: 2 All genes with less hits are discarded prior to
        the alignment step. This option is ignored if --all-proteomes is
        set.

    [--all-proteomes]
        Sets --min-proteomes to the total number of proteomes supplied.
        Default: not set All genes that do not hit all of the proteomes are
        discarded prior to the alignment step. If set --min-proteomes is
        ignored.

    [--hmmfile=]
        Path to HMM file to be used for hmmsearch. Default:
        /data/essential.hmm

    [--raxml-x-rapidBootstrapRandomNumberSeed=]
        Random number seed for raxml (passed through as -x option to raxml).
        Default: Random number in range 1..1000000 (see raxml command in log
        file to find out the actual value). Note: you can abbreviate options
        (as long as they stay unique) so --raxml-x=12345 is equivalent to
        --raxml-x-rapidBootstrapRandomNumberSeed=12345

    [--raxml-p-parsimonyRandomSeed=]
        Random number seed for raxml (passed through as -p option to raxml).
        Default: Random number in range 1..1000000 (see raxml command in log
        file to find out the actual value). Note: you can abbreviate options
        (as long as they stay unique) so --raxml-p=12345 is equivalent to
        --raxml-p-parsimonyRandomSeed=12345

    [--raxml-aa-substitiution-model ""]
        The aminoacid substitution model used for the partitions by RAxML.
        Valid options for RAxML 8.x are: DAYHOFF, DCMUT, JTT, MTREV, WAG,
        RTREV, CPREV, VT, BLOSUM62, MTMAM, LG, MTART, MTZOA, PMB, HIVB,
        HIVW, JTTDCMUT, FLU, STMTREV, DUMMY, DUMMY2, AUTO, LG4M, LG4X,
        PROT_FILE, GTR_UNLINKED, GTR bcgTree will not check whether the
        provided option is valid but rather pass it to RAxML literally.
        Default: AUTO

    [--raxml-args ""]
        Arbitrary options to pass through to RAxML. The ARGS part should be
        in quotes and is appended to the RAxML command as given.

#+END_SRC
** Results
The results all end up in the directory specified via --outdir (or bcgTree if none is specified).
This folder contains lots of intermediate files from all steps.
If the run was successful the most interesting files will be the RAxML files:
 - /RAxML_bestTree.final
 - /RAxML_bipartitionsBranchLabels.final
 - /RAxML_bipartitions.final
 - /RAxML_bootstrap.final
 - /RAxML_info.final
Further the log file (/bcgTree.log) contains all executed commands and their output.
This is useful as a reference, for re-executing steps manually and for debugging in case something went wrong.
All other files are the outputs of different steps of the pipeline.
Their names should be self-explanatory.
** Background
107 essential genes as described in:
Dupont CL, Rusch DB, Yooseph S, et al. Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. The ISME Journal. 2012;6(6):1186-1199. doi:10.1038/ismej.2011.189.
Supplementary Table S1 (which is actually an image) contains a list of the used genes and HMMs with cut-offs.

From the manuscript:
"Genome completeness estimates
Using the Comprehensive Microbial Resource as a database, 107 hidden Markov models (HMMs) that hit
only one gene in greater than 95% of bacterial genomes were identified (Supplementary Table S1).
Trusted cutoff scores for the TIGRFAMs and Pfam HMMs were those supplied by the 
TIGRFAMs and Pfam libraries (Haft et al., 2003; Finn et al., 2010)."

In the publication:
M Albertsen,	Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW and Nielsen PH, Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nature Biotechnology 31, 533–538 (2013) doi:10.1038/nbt.2579
the authors use the same list of 107 genes (111 HMMs, glyS, pheT, proS and rpoC have two HMMs each)
as above and provide a readily created hmm file via [[https://github.com/MadsAlbertsen/multi-metagenome/][GitHub]].
This file has been used as a starting point but an [[https://github.com/MadsAlbertsen/multi-metagenome/issues/15][error]] had to be fixed.

** Logo
The logo has been designed by Markus J. Ankenbrand and Alexander Keller.
Cliparts from [[openclipart.org]] have been used:
 - [[https://openclipart.org/detail/188718/oak-tree][Oak Tree]] ([[https://openclipart.org/share][CC-0/public domain]])
 - [[https://openclipart.org/detail/125869/diagramme-de-venn-venn-diagram][Venn Diagram]] ([[https://openclipart.org/share][CC-0/public domain]])
The font is from [[fontlibrary.org]]:
 - [[https://fontlibrary.org/en/font/ranchers][Ranchers]] ([[http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL][SIL Open Font License]])

** Related Tools

These are some tools with similar goals and approaches. If you know another one, please open a pull request to add it to the list.

 - [[https://github.com/jvollme/PO_2_MLSA][PO_2_MLSA]] - by [[https://github.com/jvollme][@jvollme]] - Enables the calculation of phylogenies based on single copy core genome gene products, based on bidirectional BLAST results obtained with proteinortho.
 - [[https://www.ezbiocloud.net/tools/ubcg][UBCG]] - by Na, S. I., Kim, Y. O., Yoon, S. H., Ha, S. M., Baek, I. & Chun, J. (2018). UBCG: Up-to-date bacterial core gene set and pipeline for phylogenomic tree reconstruction. J Microbiol 56. [[https://doi.org/10.1007/s12275-018-8014-6][Paper]]

** Changes
[[https://github.com/molbiodiv/bcgTree/actions/workflows/test_perl.yml][https://github.com/molbiodiv/bcgTree/actions/workflows/test_perl.yml/badge.svg?branch=master]]
*** v1.3.0 <2025-05-13>
 - Update muscle dependency to version 5 - breaking: v3 is no longer supported (#48)
*** v1.2.1 <2024-01-03>
 - Fix issue in GUI if no proteome is provided (#50)
*** v1.2.0 <2021-10-27>
 - Add parameter ~--genome~ with translation via ~prodigal~ (#15)
 - Update and improve documentation (#19, #35, #38, #44)
 - Make GUI independent of working directory (#33)
 - Switch CI from Travis to GitHub Actions (#45)
*** v1.1.0 <2018-07-19>
 - Breaking: the default aa substitution model for RAxML changed from WAG to AUTO.
   This has an impact on performance (it is faster to set this parameter to a fixed value).
   To get the same behaviour as in earlier versions pass ~--raxml-aa-substitution-model=WAG~
 - Add parameter ~--raxml-aa-substitution-model~ (#29)
 - Add HMMs of [[https://www.ezbiocloud.net/tools/ubcg][UBCG]] (#25)
*** v1.0.10 <2017-03-07>
 - Fix GUI, add scrollbar (#23)
 - Add parameter --raxml-args (#22)
*** v1.0.9 <2017-03-03>
 - Add parameters --min-proteomes and --all-proteomes (#21)
*** v1.0.8 <2016-09-07>
 - Set default bootstraps to 100
 - Add description for reproduction of results in paper
*** v1.0.7 <2016-06-16>
 - Add logo to GUI
*** v1.0.6 <2016-03-17>
 - Improve layout (avoid errors with large text fields)
 - Update jar file
*** v1.0.5 <2016-03-17>
 - Add advanced settings and external programs to GUI
 - Add GUI screenshots to README
 - Finish GUI layout
 - Fix outdir bug (manually entered text was ignored)
 - Update documentation in README
 - Improve layout of GUI (proteomes panel)
*** v1.0.4 <2016-02-23>
 - Add parameter to check external programs
 - Fix SeqFilter dependencies
 - Add swingx and own accordion element for GUI
 - Improve GUI design (GridBagLayout)
*** v1.0.3 <2016-02-23>
 - Add log4perl and Getopt::ArgvFile to package (simplify installation)
*** v1.0.2 <2016-02-22>
 - Remove Bioperl dependency
 - Add submodules directly (SeqFilter)
 - Update documentation
*** v1.0.1 <2016-02-22>
 - Add java GUI

Owner

  • Name: Molecular Biodiversity Group University of Würzburg
  • Login: molbiodiv
  • Kind: organization
  • Location: Würzburg, Germany

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite both the software and the paper (see preferred-citation).
authors:
  - family-names: Ankenbrand
    given-names: "Markus J"
    orcid: 0000-0002-6620-807X
    affiliation: "Center for Computational and Theoretical Biology, University of Würzburg, Germany"
    email: markus.ankenbrand@uni-wuerzburg.de
  - family-names: Keller
    given-names: Alexander
    orcid: 0000-0001-5716-3634
    affiliation: "Center for Computational and Theoretical Biology, University of Würzburg, Germany"
title: bcgTree
version: 1.3.0
doi: 10.5281/zenodo.597913
date-released: 2021-10-27
repository-code: https://github.com/molbiodiv/bcgTree
keywords:
  - Phylogeny
  - Alignment
  - Evolution
license: MIT
url: https://github.com/molbiodiv/bcgTree
preferred-citation:
  type: article
  scope: Cite this paper if you used bcgTree in your research
  authors:
    - family-names: Ankenbrand
      given-names: Markus J
    - family-names: Keller
      given-names: Alexander
  title: "bcgTree: automatized phylogenetic tree building from bacterial core genomes"
  year: 2016
  journal: Genome
  volume: 59
  issue: 10
  doi: 10.1139/gen-2015-0175
  url: https://doi.org/10.1139/gen-2015-0175

GitHub Events

Total
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 4
  • Push event: 1
Last Year
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 4
  • Push event: 1

Dependencies

lib/Getopt-ArgvFile-1.11/META.yml cpan
  • Test::Harness 1.25
  • Test::More 0.11
  • Text::ParseWords 3.1
lib/Log-Log4perl-1.46/META.json cpan
  • ExtUtils::MakeMaker 0
  • File::Path 2.0606
  • File::Spec 0.82
  • Test::More 0.45
lib/Log-Log4perl-1.46/META.yml cpan
  • File::Path 2.0606
  • File::Spec 0.82
  • Test::More 0.45
.github/workflows/test_perl.yml actions
  • actions/checkout v2 composite
  • shogo82148/actions-setup-perl v1 composite