bcgtree
Automatically calculate phylogenetic trees from bacterial core genes
Science Score: 65.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 14 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization molbiodiv has institutional domain (www.dna-analytics.biozentrum.uni-wuerzburg.de) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary
Keywords
bioinformatics
phylogenetics
Last synced: 6 months ago
·
JSON representation
·
Repository
Automatically calculate phylogenetic trees from bacterial core genes
Basic Info
- Host: GitHub
- Owner: molbiodiv
- License: mit
- Language: Perl
- Default Branch: master
- Size: 19.2 MB
Statistics
- Stars: 20
- Watchers: 4
- Forks: 9
- Open Issues: 5
- Releases: 14
Topics
bioinformatics
phylogenetics
Created over 10 years ago
· Last pushed 10 months ago
Metadata Files
Readme
License
Citation
README.org
* bcgTree
Automatized phylogenetic tree building from bacterial core genomes.
[[doc/bcgTree.png]]
An article describing bcgTree is published in [[http://www.nrcresearchpress.com/doi/abs/10.1139/gen-2015-0175][Genome]].
If you use bcgTree for your research please cite: [[http://dx.doi.org/10.1139/gen-2015-0175][https://img.shields.io/badge/DOI-10.1139%2Fgen--2015--0175-blue.svg]]
Also please cite the external programs and the source of the HMMs (see Background section) if you use the default essential.hmm
If you like the tool consider voting for it on [[https://labworm.com/tool/bcgtree][LabWorm]].
See [[file:reproduce_results.org][this file]] for instructions on how to reproduce results from our article.
** Dependencies
Please note that some of the dependencies have their own Licenses.
*** Perl
To execute bcgTree [[https://www.perl.org/][perl5]] is required along with the following modules (should be part of the core installation, but also available via [[http://www.cpan.org/][cpan]]):
- Getopt::Long
- Pod::Usage
- FindBin
- File::Path
- File::Spec
The following modules are included in this repo:
- [[http://search.cpan.org/~mschilli/Log-Log4perl-1.46/lib/Log/Log4perl.pm][Log::Log4perl]] ([[file:lib/Log-Log4perl-1.46/LICENSE][Artistic License]])
- [[http://search.cpan.org/~jstenzel/Getopt-ArgvFile-1.11/ArgvFile.pm][Getopt::ArgvFile]] ([[file:lib/Getopt-ArgvFile-1.11/README][Artistic License]])
- [[https://github.com/BioInf-Wuerzburg/perl5lib-Fasta][perl5lib-Fasta]] ([[file:lib/perl5lib-Fasta/LICENSE][MIT License]])
- [[https://github.com/BioInf-Wuerzburg/perl5lib-Fastq][perl5lib-Fastq]] ([[file:lib/perl5lib-Fastq/LICENSE][MIT License]])
- [[https://github.com/BioInf-Wuerzburg/perl5lib-Verbose][perl5lib-Verbose]] ([[file:lib/perl5lib-Verbose/LICENSE][MIT License]])
*** Java
To use the graphical user interface a Java Runtime Environment ([[http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html][JRE]], tested with version 8) is required.
The following extra frameworks are used (included):
- [[https://swingx.java.net/][SwingX]] 1.6.5 ([[http://www.gnu.org/copyleft/lesser.html][LGPL]])
*** External programs
bcgTree is a wrapper around multiple existing tools.
The following external programs are called by bcgTree and have to be installed (~prodigal~ is optional, it is only needed if you provide nucleotide sequences using ~--genome~).
The specified versions are the ones we used for testing (older versions might or might not work).
Newer versions should work (otherwise feel free to open an issue).
- [[http://hmmer.org/][hmmsearch]] (HMMER version 3.3) - Eddy et al, 2010. HMMER3: a new generation of sequence homology search software.
- [[https://github.com/rcedgar/muscle/releases/][muscle]] (v5.3) - Edgar RC, 2021, MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping, [bioRxiv 2021.06.20.449169](https://doi.org/10.1101/2021.06.20.449169). NOTE: please use Version 5+ (v3 is no longer supported in bcgTree v1.3.0 and later)
- [[http://molevol.cmima.csic.es/castresana/Gblocks.html][Gblocks]] (version 0.91b) - Castresana et al, 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552.
- [[http://sco.h-its.org/exelixis/web/software/raxml/][RAxML]] (version 8.2.12) - Stamatakis et al, 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313.
- [[https://github.com/hyattpd/Prodigal][prodigal]] (version 2.6.3) - Hyatt et al, 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119. https://doi.org/10.1186/1471-2105-11-119
Additionally [[https://github.com/BioInf-Wuerzburg/SeqFilter][SeqFilter]] (Hackl et al, 2014. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30, 3004-3011) is needed but it is included in this repo.
*** Data
There are two hmm files included in the data directory.
Both contain HMMs from [[ftp://ftp.tigr.org/pub/data/TIGRFAMs][TIGRFAM]] ([[ftp://ftp.tigr.org/pub/data/TIGRFAMs/COPYRIGHT][LGPL]]) and [[https://pfam.xfam.org/][PFAM]] ([[ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/relnotes.txt][CC0]]).
The ubcg.hmm file is reproduced from [[https://www.ezbiocloud.net/tools/ubcg][UBCG]] as described in [[https://doi.org/10.1007/s12275-018-8014-6][Na, SI., Kim, Y.O., Yoon, SH. et al. J Microbiol. (2018) 56: 280.]]
** Installation
You can download the latest release from [[https://github.com/molbiodiv/bcgTree/releases][here]].
After extraction you can start the graphical interface by double clicking the bcgTree.jar file in the bcgTreeGUI folder.
Please remember to install the required external programms (see above).
Alternatively you can start the perl script bcgTree.pl in the bin folder from the command line.
Using the command line interface is the recommended way.
You can also clone the bcgTree git repository by executing the following commands:
#+BEGIN_SRC sh
git clone https://github.com/molbiodiv/bcgTree.git
# Now you can run bcgTree.pl
bcgTree/bin/bcgTree.pl --help
# Or start the java GUI
cd bcgTree/bcgTreeGUI
java -jar bcgTree.jar
#+END_SRC
** Usage
*** GUI
The graphical user interface is just a convenient way to call bcgTree.
Usage is meant to be self-explanatory, any suggestions for improvement are welcome (please open an [[https://github.com/molbiodiv/bcgTree/issues][issue]]).
In fact it adds no extra functionality it just collects the parameters and calls the perl script on the command line.
The output is written to a text field as well as a log file (bcgTree.log in the output folder).
In order to preserve replicability all parameters are written in a file called options.txt in the output folder.
The internal call of bcgTree is then just:
#+BEGIN_SRC sh
bcgTree.pl @outdir/options.txt
#+END_SRC
If you feel comfortable using the command line, calling the perl script directly is still the recommendet way to use bcgTree.
Nevertheless here are some screenshots of the GUI:
#+ATTR_HTML: :width 640
[[doc/screenshot0.png]]
#+ATTR_HTML: :width 640
[[doc/screenshot1.png]]
#+ATTR_HTML: :width 640
[[doc/screenshot2.png]]
*** Command line
#+BEGIN_SRC sh
Usage:
$ bcgTree.pl [@configfile] --proteome bac1=bacterium1.pep.fa --proteome bac2=bacterium2.faa [options]
Options:
[@configfile] Optional path to a configfile with @ as prefix.
Config files consist of command line parameters
and arguments just as passed on the command
line. Space and comment lines are allowed (and
ignored). Spreading over multiple lines is
supported.
--proteome = [--proteome = ..]
Multiple pairs of organism and proteomes as
peptide fasta file paths Attention: If you
provide a proteome and genome with the same
name, only the genome will be used.
--genome = [--genome = ..]
Multiple pairs of organism and genomes as
nucleotide fasta file paths. Attention: If you
provide a proteome and genome with the same
name, only the genome will be used.
[--outdir ] output directory for the generated output files
(default: bcgTree)
[--help] show help
[--version] show version number of bcgTree and exit
[--check-external-programs]
Check if all of the required external programs
can be found and are executable, then exit.
Report table with program, status (ok or
!fail!) and path. If all external programs are
found exit code is 0 otherwise 1. Note that
this parameter does not check that the paths
belong to the actual programs, it only checks
that the given locations are executable files.
[--hmmsearch-bin=] Path to hmmsearch binary file. Default tries if
hmmsearch is in PATH;
[--muscle-bin=] Path to muscle binary file. Default tries if
muscle is in PATH;
[--gblocks-bin=] Path to the Gblocks binary file. Default tries
if Gblocks is in PATH;
[--raxml-bin=] Path to the raxml binary file. Default tries if
raxmlHPC is in PATH;
[--prodigal-bin=]
Path to the prodigal binary file. Default tries if prodigal is in
PATH;
[--threads=]
Number of threads to be used (currently only relevant for raxml).
Default: 2 From the raxml man page: PTHREADS VERSION ONLY! Specify
the number of threads you want to run. Make sure to set "-T" to at
most the number of CPUs you have on your machine, otherwise, there
will be a huge performance decrease!
[--bootstraps=]
Number of bootstraps to be used (passed to raxml). Default: 100
[--min-proteomes=]
Minimum number of proteomes in which a gene must occur in order to
be kept. Default: 2 All genes with less hits are discarded prior to
the alignment step. This option is ignored if --all-proteomes is
set.
[--all-proteomes]
Sets --min-proteomes to the total number of proteomes supplied.
Default: not set All genes that do not hit all of the proteomes are
discarded prior to the alignment step. If set --min-proteomes is
ignored.
[--hmmfile=]
Path to HMM file to be used for hmmsearch. Default:
/data/essential.hmm
[--raxml-x-rapidBootstrapRandomNumberSeed=]
Random number seed for raxml (passed through as -x option to raxml).
Default: Random number in range 1..1000000 (see raxml command in log
file to find out the actual value). Note: you can abbreviate options
(as long as they stay unique) so --raxml-x=12345 is equivalent to
--raxml-x-rapidBootstrapRandomNumberSeed=12345
[--raxml-p-parsimonyRandomSeed=]
Random number seed for raxml (passed through as -p option to raxml).
Default: Random number in range 1..1000000 (see raxml command in log
file to find out the actual value). Note: you can abbreviate options
(as long as they stay unique) so --raxml-p=12345 is equivalent to
--raxml-p-parsimonyRandomSeed=12345
[--raxml-aa-substitiution-model ""]
The aminoacid substitution model used for the partitions by RAxML.
Valid options for RAxML 8.x are: DAYHOFF, DCMUT, JTT, MTREV, WAG,
RTREV, CPREV, VT, BLOSUM62, MTMAM, LG, MTART, MTZOA, PMB, HIVB,
HIVW, JTTDCMUT, FLU, STMTREV, DUMMY, DUMMY2, AUTO, LG4M, LG4X,
PROT_FILE, GTR_UNLINKED, GTR bcgTree will not check whether the
provided option is valid but rather pass it to RAxML literally.
Default: AUTO
[--raxml-args ""]
Arbitrary options to pass through to RAxML. The ARGS part should be
in quotes and is appended to the RAxML command as given.
#+END_SRC
** Results
The results all end up in the directory specified via --outdir (or bcgTree if none is specified).
This folder contains lots of intermediate files from all steps.
If the run was successful the most interesting files will be the RAxML files:
- /RAxML_bestTree.final
- /RAxML_bipartitionsBranchLabels.final
- /RAxML_bipartitions.final
- /RAxML_bootstrap.final
- /RAxML_info.final
Further the log file (/bcgTree.log) contains all executed commands and their output.
This is useful as a reference, for re-executing steps manually and for debugging in case something went wrong.
All other files are the outputs of different steps of the pipeline.
Their names should be self-explanatory.
** Background
107 essential genes as described in:
Dupont CL, Rusch DB, Yooseph S, et al. Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. The ISME Journal. 2012;6(6):1186-1199. doi:10.1038/ismej.2011.189.
Supplementary Table S1 (which is actually an image) contains a list of the used genes and HMMs with cut-offs.
From the manuscript:
"Genome completeness estimates
Using the Comprehensive Microbial Resource as a database, 107 hidden Markov models (HMMs) that hit
only one gene in greater than 95% of bacterial genomes were identified (Supplementary Table S1).
Trusted cutoff scores for the TIGRFAMs and Pfam HMMs were those supplied by the
TIGRFAMs and Pfam libraries (Haft et al., 2003; Finn et al., 2010)."
In the publication:
M Albertsen, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW and Nielsen PH, Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nature Biotechnology 31, 533–538 (2013) doi:10.1038/nbt.2579
the authors use the same list of 107 genes (111 HMMs, glyS, pheT, proS and rpoC have two HMMs each)
as above and provide a readily created hmm file via [[https://github.com/MadsAlbertsen/multi-metagenome/][GitHub]].
This file has been used as a starting point but an [[https://github.com/MadsAlbertsen/multi-metagenome/issues/15][error]] had to be fixed.
** Logo
The logo has been designed by Markus J. Ankenbrand and Alexander Keller.
Cliparts from [[openclipart.org]] have been used:
- [[https://openclipart.org/detail/188718/oak-tree][Oak Tree]] ([[https://openclipart.org/share][CC-0/public domain]])
- [[https://openclipart.org/detail/125869/diagramme-de-venn-venn-diagram][Venn Diagram]] ([[https://openclipart.org/share][CC-0/public domain]])
The font is from [[fontlibrary.org]]:
- [[https://fontlibrary.org/en/font/ranchers][Ranchers]] ([[http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL][SIL Open Font License]])
** Related Tools
These are some tools with similar goals and approaches. If you know another one, please open a pull request to add it to the list.
- [[https://github.com/jvollme/PO_2_MLSA][PO_2_MLSA]] - by [[https://github.com/jvollme][@jvollme]] - Enables the calculation of phylogenies based on single copy core genome gene products, based on bidirectional BLAST results obtained with proteinortho.
- [[https://www.ezbiocloud.net/tools/ubcg][UBCG]] - by Na, S. I., Kim, Y. O., Yoon, S. H., Ha, S. M., Baek, I. & Chun, J. (2018). UBCG: Up-to-date bacterial core gene set and pipeline for phylogenomic tree reconstruction. J Microbiol 56. [[https://doi.org/10.1007/s12275-018-8014-6][Paper]]
** Changes
[[https://github.com/molbiodiv/bcgTree/actions/workflows/test_perl.yml][https://github.com/molbiodiv/bcgTree/actions/workflows/test_perl.yml/badge.svg?branch=master]]
*** v1.3.0 <2025-05-13>
- Update muscle dependency to version 5 - breaking: v3 is no longer supported (#48)
*** v1.2.1 <2024-01-03>
- Fix issue in GUI if no proteome is provided (#50)
*** v1.2.0 <2021-10-27>
- Add parameter ~--genome~ with translation via ~prodigal~ (#15)
- Update and improve documentation (#19, #35, #38, #44)
- Make GUI independent of working directory (#33)
- Switch CI from Travis to GitHub Actions (#45)
*** v1.1.0 <2018-07-19>
- Breaking: the default aa substitution model for RAxML changed from WAG to AUTO.
This has an impact on performance (it is faster to set this parameter to a fixed value).
To get the same behaviour as in earlier versions pass ~--raxml-aa-substitution-model=WAG~
- Add parameter ~--raxml-aa-substitution-model~ (#29)
- Add HMMs of [[https://www.ezbiocloud.net/tools/ubcg][UBCG]] (#25)
*** v1.0.10 <2017-03-07>
- Fix GUI, add scrollbar (#23)
- Add parameter --raxml-args (#22)
*** v1.0.9 <2017-03-03>
- Add parameters --min-proteomes and --all-proteomes (#21)
*** v1.0.8 <2016-09-07>
- Set default bootstraps to 100
- Add description for reproduction of results in paper
*** v1.0.7 <2016-06-16>
- Add logo to GUI
*** v1.0.6 <2016-03-17>
- Improve layout (avoid errors with large text fields)
- Update jar file
*** v1.0.5 <2016-03-17>
- Add advanced settings and external programs to GUI
- Add GUI screenshots to README
- Finish GUI layout
- Fix outdir bug (manually entered text was ignored)
- Update documentation in README
- Improve layout of GUI (proteomes panel)
*** v1.0.4 <2016-02-23>
- Add parameter to check external programs
- Fix SeqFilter dependencies
- Add swingx and own accordion element for GUI
- Improve GUI design (GridBagLayout)
*** v1.0.3 <2016-02-23>
- Add log4perl and Getopt::ArgvFile to package (simplify installation)
*** v1.0.2 <2016-02-22>
- Remove Bioperl dependency
- Add submodules directly (SeqFilter)
- Update documentation
*** v1.0.1 <2016-02-22>
- Add java GUI
Owner
- Name: Molecular Biodiversity Group University of Würzburg
- Login: molbiodiv
- Kind: organization
- Location: Würzburg, Germany
- Website: http://www.dna-analytics.biozentrum.uni-wuerzburg.de/molecular_biodiversity_group/
- Repositories: 18
- Profile: https://github.com/molbiodiv
Citation (CITATION.cff)
cff-version: 1.2.0
message: If you use this software, please cite both the software and the paper (see preferred-citation).
authors:
- family-names: Ankenbrand
given-names: "Markus J"
orcid: 0000-0002-6620-807X
affiliation: "Center for Computational and Theoretical Biology, University of Würzburg, Germany"
email: markus.ankenbrand@uni-wuerzburg.de
- family-names: Keller
given-names: Alexander
orcid: 0000-0001-5716-3634
affiliation: "Center for Computational and Theoretical Biology, University of Würzburg, Germany"
title: bcgTree
version: 1.3.0
doi: 10.5281/zenodo.597913
date-released: 2021-10-27
repository-code: https://github.com/molbiodiv/bcgTree
keywords:
- Phylogeny
- Alignment
- Evolution
license: MIT
url: https://github.com/molbiodiv/bcgTree
preferred-citation:
type: article
scope: Cite this paper if you used bcgTree in your research
authors:
- family-names: Ankenbrand
given-names: Markus J
- family-names: Keller
given-names: Alexander
title: "bcgTree: automatized phylogenetic tree building from bacterial core genomes"
year: 2016
journal: Genome
volume: 59
issue: 10
doi: 10.1139/gen-2015-0175
url: https://doi.org/10.1139/gen-2015-0175
GitHub Events
Total
- Issues event: 1
- Watch event: 1
- Issue comment event: 4
- Push event: 1
Last Year
- Issues event: 1
- Watch event: 1
- Issue comment event: 4
- Push event: 1
Dependencies
lib/Getopt-ArgvFile-1.11/META.yml
cpan
- Test::Harness 1.25
- Test::More 0.11
- Text::ParseWords 3.1
lib/Log-Log4perl-1.46/META.json
cpan
- ExtUtils::MakeMaker 0
- File::Path 2.0606
- File::Spec 0.82
- Test::More 0.45
lib/Log-Log4perl-1.46/META.yml
cpan
- File::Path 2.0606
- File::Spec 0.82
- Test::More 0.45
.github/workflows/test_perl.yml
actions
- actions/checkout v2 composite
- shogo82148/actions-setup-perl v1 composite