Recent Releases of gtdbtk
gtdbtk - 2.5.0
Bug Fixes:
- (#644 , #641) Fixed compatibility with recent versions of NumPy (≥1.24), which removed the tostring() method from numpy.ndarray.
Minor Changes: * (#650) Update CLI with an up-to-date taxon.
Major Changes:
- GTDB-Tk now uses Skani exclusively for genome clustering, replacing the previous Mash/Skani hybrid approach. This change simplifies the CLI and removes the dependency on Mash, streamlining installation and execution.
- Python
Published by pchaumeil 6 months ago
gtdbtk - 2.4.1
Bug Fixes:
- (#630) Fixed SyntaxWarning in Python 3.12 by using raw strings for regex in HMMResultsIO.py
Minor Changes:
- (#631)
gtdb_to_ncbi_majority_vote.pyscript has been included as part of the release
The GTDB-Tk version has been bumped to synchronise its release with GTDB R226.
- Python
Published by pchaumeil 10 months ago
gtdbtk - 2.4.0
Bug Fixes:
- (#576) When all genomes fail the prodigal step in the
classify_wf, The bac120 summary file is still produced with the all failed genomes listed as 'Unclassified' - (#573) When running the 3 classify steps independently, a genome can be filtered out in the
alignstep but still be classified in theidentifystep. To avoid duplication of row, the genome is classified with a warning. - (#540 ) Empty files are skipped during the sketch step of
Mash, they are then catched in theprodigalstep and are returned as 'Unclassified' - (#549) :
--forcehas been modified to deal with #540.Prodigalwasn't returning the empty files as failed genomes, it was only skipping them. These genomes are now returned in the summary file and flagged as Unclassified.
Major Changes:
FastANIhas been replaced byskanias the primary tool for computing Average Nucleotide Identity (ANI).Users may notice slight variations in the results compared to those obtained usingFastANI.In the generated
summary.tsvfiles, several columns have been renamed for clarity and consistency. The following columns have been affected:- "
fastani_reference" column has been renamed to "closest_genome_reference". - "
fastani_reference_radius" column has been renamed to "closest_genome_reference_radius". - "
fastani_taxonomy" column has been renamed to "closest_genome_taxonomy". - "
fastani_ani" column has been renamed to "closest_genome_ani". - "
fastani_af" column has been renamed to "closest_genome_af".
- "
These changes have been implemented to improve the readability and understanding of the data within the summary.tsv files. Users should update their scripts or processes accordingly to reflect these renamed column headers.
- Python
Published by pchaumeil almost 2 years ago
gtdbtk - 2.3.0
Bug Fixes:
- (#508) (#509) If ALL genomes for a specific domain are either filtered out or classified with ANI they are now reported in the summary file.
Minor changes:
- (#491) (#498) Allow GTDB-Tk to show
--helpand-vwithoutGTDBTK_DATA_PATHbeing set.- WARNING: This is a breaking change if you are importing GTDB-Tk as a library and importing values from
gtdbtk.config.config, instead you need to import asfrom gtdbtk.config.common import CONFIGthen access values viaCONFIG.<var>
- WARNING: This is a breaking change if you are importing GTDB-Tk as a library and importing values from
- (#508) Mash distance is changed from 0.1 to 0.15 . This is will increase the number of FastANI comparisons but will cover cases wheere genomes have a larger Mash distance but a small ANI.
- (#497) Add a
convert_to_speciesfunction is GTDB-Tk to replace GCA/GCF ids with their GTDB species name - Add
--db_versionflag tocheck_installto check the version of previous GTDB-Tk packages.
- Python
Published by pchaumeil almost 3 years ago
gtdbtk - 2.2.6
2.2.6
Bug Fixes:
- (#493) Fix issue with --full-tree flag (related to skipping ANI steps)
Minor changes:
- Change URL for documentation to 'https://ecogenomics.github.io/GTDBTk/installing/index.html'
- Improve portability of the ANIscreen step by regenerating the paths of reference genomes in the current filesystem for mashdb.msh
- Python
Published by pchaumeil almost 3 years ago
gtdbtk - 2.2.5
2.2.5
Bug Fixes:
* gtdbtk.json is now reset when the pipeline is re run and the status of ani_screen is not 'complete'
Minor changes:
* When using --genes , ANI steps are skipped and warnings are raised to the user to
inform them that classification is less accurate.
* (#486) Environment variables can be used in GTDBTKDATAPATH
* is_consistent function in mash.py compares only the filenames, not the full paths
* Add cutoff arguments to PfamScan ( Thanks @AroneyS for the contribution)
- Python
Published by pchaumeil almost 3 years ago
gtdbtk - 2.2.4
Bug Fixes: * (#475) If all genomes are classified using ANI, Tk will skip the identify step and align steps
Minor changes: * Add hidden '--skippplacer' flag to skip pplacer step ( useful for debugging) * Improve documentation * Convert stagelogger to a Singleton class * Use existing ANI results if available
- Python
Published by pchaumeil almost 3 years ago
gtdbtk - 2.2.0
2.2.0
Minor changes:
- (#433) Added additional checks to ensure that the
--outgroup_taxoncannot be set to a domain (root,de_novo_wf). - (#459/ #462 ) Fix deprecated np.bool in prodigal_biolib.py. Special thanks to @neoformit for his contribution.
- (#466 ) RED value has been rounded to 5 decimals after the comma.
- (#451 ) Extra checks have been added when Prodigal fails.
- (#448) Warning has been added when all the genomes are filtered out and not classified.
Bug Fixes:
- (#420 ) Fixed an issue where GTDB-Tk might hang when classifying TIGRFAM markers (
identify,classify_wf,de_novo_wf). Special thanks to @lfenske-93 and @sjaenick for their contribution. - (#428) Fixed an issue where the
--gtdbtk_classification_filewould raise an error trying to read theclassifysummary (root,de_novo_wf). - (#439) Fix the pipeline when using protein files instead of nucleotide files. symlink uses absolute path instead.
- Python
Published by pchaumeil about 3 years ago
gtdbtk - 2.1.0
Major changes:
- GTDB-TK now uses a divide-and-conquer approach where the bacterial reference tree is split into multiple class-level subtrees. This reduces the memory requirements of GTDB-Tk from 320 GB of RAM when using the full GTDB R07-RS207 reference tree to approximately 55 GB. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the
--full-treeflag. This is the main change from v2.0.0. The split tree approach has been modified from order-level trees to class-level trees to resolve specific classification issues (see #383). - Genomes that cannot be assigned to a domain (e.g. genomes with no bacterial or archaeal markers or genomes with no genes called by Prodigal) are now reported in the
gtdbtk.bac120.summary.tsvas 'Unclassified' - Genomes filtered out during the alignment step are now reported in the
gtdbtk.bac120.summary.tsvorgtdbtk.ar53.summary.tsvas 'Unclassified Bacteria/Archaea' --write_single_copy_genesflag in now available in theclassify_wfandde_novo_wfworkflows.
Features:
- (#392)
--write_single_copy_genesflag available in workflows. - (#387) specific memory requirements set in classify_wf depending on the classification approach.
Important
This version is not backwards compatible with GTDB package R207 v1. This version requires a new reference package
- Python
Published by pchaumeil almost 4 years ago
gtdbtk - 2.0.0
Major changes:
* GTDB-TK now uses a divide-and-conquer approach where the bacterial reference tree is split into multiple order-level subtrees. This reduces the memory requirements of GTDB-Tk from 320 GB of RAM when using the full GTDB R07-RS207 reference tree to approximately 35 GB. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the --full-tree flag.
* Archaeal classification now uses a refined set of 53 archaeal-specific marker genes based on the recent publication by Dombrowski et al., 2020. This set of archaeal marker genes is now used by GTDB for curating the archaeal taxonomy.
* By default, all directories containing intermediate results are now removed by default at the end of the classify_wf and de_novo_wf pipelines. If you wish to retain these intermediates files use the --keep-intermediates flag.
* All MSA files produced by the align step are now compressed with gzip.
* The classification summary and failed genomes files are now the only files linked in the root directory of classify_wf.
Features:
* convert_to_itol to convert trees into iTOL format (#373)
* Output FASTA files are compressed by default (#369)
* Intermediate files will be removed by default when using classify/de-novo workflows unless specified by --keep_intermediates (#369)
* Add --genes flag for Error (#362)
* A warning will be displayed if pplacer fails to place a genome (#360 / #356)
Important * This version is not backwards compatible with GTDB release 202. * This version requires a new reference package
- Python
Published by aaronmussig almost 4 years ago
gtdbtk - 1.7.0
- (#336) Warn the user if they have provided an incorrectly formatted taxonomy file.
- (#348) Gracefully exit the program if no single copy hits could be identified.
- (#351) Fixed an issue where GTDB-Tk would crash if spaces were present in the reference data path.
- (#354) Added optional --tmpdir argument to set temporary directory (thanks @tr11-sanger ).
- Python
Published by aaronmussig over 4 years ago
gtdbtk - 1.5.0
Changes: * Updated to use PFAM 33.1 markers. * Updated to use GTDB R202 taxonomy (note, this will require an update to the reference package https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data)
Fixes: * Automatic drop of genome leads to error in downstream modules of classifywf (#312) * --scratchdir not working in v 1.4.1 (#311)
- Python
Published by aaronmussig almost 5 years ago
gtdbtk - 1.4.0
- Check if stdout is being piped to a file before adding colour.
- (#283) Significantly improved classify performance (noticeable when running trees > 1,000 taxa).
- Automatically cap pplacer CPUs to 64 unless specifying
--pplacer_cpusto prevent pplacer from hanging. - (#262) Added
--write_single_copy_genesto the identify command. Writes unaligned single-copy AR122/BAC120 marker genes to disk. - When running -version warn if GTDB-Tk is not running the most up-to-date version (disable via
GTDBTK_VER_CHECK = Falsein config.py). If GTDB-Tk encounters an error it will silently continue (3 second timeout). - (#276) Renamed the column
aa_percenttomsa_percentin summary.tsv (produced by classify). - (#286) Fixed a file not found error when the reference data is a symbolic link (thanks davidealbanese!).
- (#277) Fixed an issue where if the user overrides the translation table using the optional 3rd column in the batchfile, the other coding density would appear as -100. Both translation table densities are now reported.
- The check_install command now also checks that all third party binaries can be found on the system path.
- The align step is now approximately 10x faster.
- (#289) Added
--min_afto classify and classify_wf which allows the user to specify the minimum alignment fraction for FastANI. - Added the
--mash_dbcommand to re-use the GTDB-Tk Mash reference database in ani_rep.
- Python
Published by aaronmussig about 5 years ago
gtdbtk - 1.3.0
This version of GTDB-Tk requires a new version of the GTDB-Tk reference package (gtdbtkr95data.tar.gz) available here.
Features: * Updated reference package to use the GTDB Release 95 taxonomy. * Report if the species-specific ANI circumscription criteria is satisfied in the aniclosest.tsv file output by anirep. * Estimated time until completion has been dampened.
- Python
Published by aaronmussig over 5 years ago
gtdbtk - 1.2.0
Bug fixes: * (#241) Moved GTDB-Tk entry point to main.py instead of bin/gtdbtk to support execution in some HPC systems (gtdbtk will still be aliased on install). * (#251) Allow parsing of FastANI v1.0 output files. However, a warning will be displayed to update FastANI. * (#254) Fixed an issue where --scratch_dir would fail, and not clean-up the mmap file.
Features: * (#242) Added the decorate command allowing the de novo workflow to be run * (#244) Added the infer_rank method which established the taxonomic ranks of internal nodes of user trees based on RED * (#248) If the identify command is run on the same directory, genomes which were already processed will be skipped. * (#248) Improved pplacer output with running the classify command
- Python
Published by aaronmussig over 5 years ago
gtdbtk - 1.1.0
- Bug fixes:
* In rare cases pplacer would assign an empty taxonomy string which would raise an error.
* (#229) Genomes using windows line carriage
\r\nwould raise an error. * (#227) CentOS machines would fail when using~in paths. * The bac120 symlink was pointing to the archaeal tree when using therootcommand.- Features:
- Updated the
gtdb_to_ncbi_majority_vote.pyscript for translating taxonomy. - (#195) Added the
--pplacer_cpusargument to specify the number of pplacer threads when runningclassifyandclassify_wf(#195). - (#198) The
--debugflag ofalignoutputs aligned markers to disk before trimming. - (#225) An optional third column in the
--batchfilewill specify an override to which translation table should be used. Leave blank to automatically determine the translation table (default). - (#131) Users can now specify genomes which have NCBI accessions, as long as they are not GTDB-Tk representatives (a warning will be raised).
- (#191) Added a new command
ani_repwhich calculates the ANI of input genomes to all GTDB representative genomes.- This command uses Mash in a pre-filtering step. If pre-filtering is enabled (default)
then
mashwill need to be on the system path. To disable pre-filtering use the--no_mashflag.
- This command uses Mash in a pre-filtering step. If pre-filtering is enabled (default)
then
- (#230) Improved how markers are used in determining the correct domain, and gene selection for the alignment.
- Updated the
- Features:
- Python
Published by aaronmussig almost 6 years ago
gtdbtk - 1.0.0
- Migrated to Python 3, you must be running at least Python 3.6 or later to use this version.
check_installnow does an exhaustive check of the reference data.- Resolved an issue where gene calling would fail for low quality genomes (#192).
- Improved FastANI multiprocessing performance.
- Third party software versions are reported where possible.
- Python
Published by aaronmussig about 6 years ago
gtdbtk - 0.3.3
- A bug has been fixed which affected classify and classify_wf when using the --batchfile argument with genome IDs that differed from the FASTA filename. This issue resulted in the assigned taxonomy being derived only from tree placement without any ANI calculations being considered. Consequently, in some cases genomes may have been classified as a new species within a genus when they should have been assigned to an existing species. If you have genomes with species assignments this bug did not impact you.
- Progress is now displayed for: hmmalign, and pplacer.
- Fixed an issue where the root command could not be run independently.
- Improved MSA masking performance.
- Python
Published by aaronmussig over 6 years ago
gtdbtk - 0.3.0
- GTDB-Tk v0.3.0 has been released (we recommend all users update to this version):
- Best translation table displayed in summary file.
- GTDB-Tk now supports gzipped genomes as inputs (--extension .gz).
- By default, GTDB-Tk uses precalculated RED values.
- New option to recalculate RED value during classify step (--recalculate_red).
- New option to export the untrimmed reference MSA files.
- New option to skip_trimming during align step.
- New option to use a custom taxonomy file when rooting a tree.
- New FAQ page available.
- New output structure.
- This version requires a new version of the GTDB-Tk data package (gtdbtkr89data.tar.gz) available here
- Python
Published by aaronmussig over 6 years ago
gtdbtk - 0.2.1
- GTDB-Tk v0.2.1 has been released (we recommend all users update to this version):
- Species classification is now based strictly on the ANI to reference genomes
- The "classify" function now reports the closest reference genome in the summary file even if the ANI is <95%
- The summary.tsv file has 4 new columns: aapercent, redvalues, fastanireferenceradius, and warnings
- By default, the "align" function now performs the same MSA trimming used by the GTDB
- New pplacer support for writing to a scratch file (--mmap-file option)
- Random seed option for MSA trimming has been added to allow for reproducible results
- Configuration of the data directory is now set using the environmental variable GTDBTKDATAPATH (see pip installation)
- Perl dependencies has been removed
- Python libraries biolib, mpld3 and jinja have been removed
- This version requires a new version of the GTDB-Tk data package (gtdbtk.r86v2data.tar.gz) available here
- Python
Published by pchaumeil almost 7 years ago
gtdbtk - 0.1.6
- align step in classifywf and denovo_wf function has been fixed.
- improve summary file output.
- "align" function now supports the same custom trimming GTDB will be performing.
- returns closest reference genome to summary file (even if the ANI is less than 95%)
- bug fixing
- Python
Published by pchaumeil about 7 years ago
gtdbtk - 0.1.0
- GTDB-Tk is now using archived (.gz) fna files.
- Optimised for R86 version
- summary.tsv file is now the main output file.
- fastani.tsv file is now combined with summary.tsv.
- red_value.tsv file has been removed.
- Each Pplacer placement on a species branch is now verify by FastANI and the ANI is compared with all other species in the same genus to check Pplacer accuracy.
- New functionality: "trim_msa" allows to trim an untrimmed MSA (41155AA for bac120 and 32675AA for ar122) based on GTDB-Tk masks
- Python
Published by pchaumeil over 7 years ago
gtdbtk - 0.0.4-beta
First Beta version of GTDB-Tk
- Python
Published by pchaumeil almost 8 years ago