Recent Releases of veba

veba - VEBA_v2.5.1

[2.5.1] - 2025.04.12

Added

  • Added install-gpu.sh which installs GPU accelerated environments when applicable (i.e., VEBA-binning-prokaryotic_env and VEBA-binning-viral_env)
  • Added Dockerfile-GPU which is experimental

Changed

  • Changed install.sh so it only installs CPU-based environments Issue #167
  • Changed containerize_environments.sh so it only installs CPU-based environments Issue #167

Deprecated

  • Deprecated VirFinder algorithm in binning-viral.py so now only geNomad is supported

- Python
Published by jolespin about 1 year ago

veba - VEBA_v2.5.0

[2.5.0] - 2025.04.10

Added

  • Added VAMB support to binning-prokaryotic.py (now a default binner) and binning_wrapper.py.
  • Added automatic gzipping of output files based on .gz extension in edgelist_to_clusters.py using pyexeggutor.open_file_writer.
  • Added xxhash dependency to VEBA-binning-prokaryotic_env for bin name reproducibility (Issue #140).
  • Added -e/--exclude and -d/--domain_predictions options to filter_binette_results.py for removing eukaryotic genomes and setting up domain assignments (Issue #153).
  • Added semibin2-[biome] option to binning-prokaryotic.py allowing specification of multiple biomes (e.g., semibin2-global, semibin2-ocean), replacing --semibin2_biome (Issue #155).
  • Added --semibin2_orf_finder option to binning_wrapper.py.
  • Added genome_statistics.tsv.gz, gene_statistics.cds.tsv.gz, gene_statistics.rRNA.tsv.gz, and gene_statistics.tRNA.tsv.gz outputs to essentials.py.
  • Added --identifiers, --index_name, and --no_header options to convert_metabat2_coverage.py for broader applicability, including VAMB.
  • Added -l eukaryota_odb12 as default but also allow --auto-lineage-euk for BUSCO in binning-eukaryotic.py

Changed

  • Changed binning-eukaryotic.py behavior to provide a solution to BUSCO Issue #447
  • Changed CHANGELOG.md format to best practice Keep a Changelog
  • Changed prodigal-gv to pyrodigal-gv in multithreaded mode for binning-viral.py for performance.
  • Removed metacoag from the default set of binning algorithms in binning-prokaryotic.py.
  • Updated geNomad to v1.11.0 and geNomad database to v1.8 to resolve numpy import errors (Issue #160).
  • Updated Pyrodigal usage in binning-eukaryotic.py for organelles to allow piping and threading.
  • Updated BUSCO to v5.8.3 and associated databases.
  • Updated Tiara to Tiara-NAL in VEBA-binning-prokaryotic_env and VEBA-binning-eukaryotic_env to enable stdin usage.
  • Updated biosynthetic.py to use antiSMASH v7 (Issue #159).
  • Changed behavior when --taxon fungi is specified: precomputed genes are not used due to formatting issues.
  • Simplified the method for adding headers to Diamond outputs in biosynthetic.py.
  • Changed Dockerfile working directory from /tmp/ to /home/.
  • Integrated Tiara and consensus_domain_classification.py into the binette step of binning-prokaryotic.py.
  • Renamed database identifier from VDB to VEBA-DB.
  • Updated CheckM2 and Binette versions in binning-prokaryotic.py.
  • Updated CheckM2 Diamond database included in VEBA-DB_v9 (Issue #154).
  • Removed usage of precomputed genes in the SemiBin2 wrapper due to SemiBin2/issue-#185.
  • Allowed faulty return codes in iterative mode for binette to permit convergence in genome recovery.

Fixed

  • Fixed CONDA_ENVS_PATH detection in the veba controller executable to correctly handle environments outside the base Conda directory.
  • Fixed bug where VFDB hits were incorrectly counted as MIBiG in biosynthetic.py (Issue #141).
  • Fixed --tta_threshold argument in biosynthetic.py which was previously defined but not connected to the command execution.
  • Removed capitalization from column headers in filter_binette_results.py output.
  • Fixed missing --antismash_options argument connection in biosynthetic.py.

Removed

  • Removed CONCOCT support from binning-eukaryotic.py.

Deprecated

  • Deprecated amplicon.py module in favor of external pipelines like nf-core/ampliseq.

- Python
Published by jolespin about 1 year ago

veba - VEBA_v2.4.2

v2.4.2 fixed a small bug where de bruijn graph for MEGAHIT wasn't included in output directory if the graph was created [2025.2.1] - Added --megahitbuilddebruijngraph to make de-Bruijn graph construction for MEGAHIT optional in assembly.py

- Python
Published by jolespin over 1 year ago

veba - VEBA_v2.4.1

  • [2025.2.1] - Added --megahit_build_de_bruijn_graph to make de-Bruijn graph construction for MEGAHIT optional in assembly.py

- Python
Published by jolespin over 1 year ago

veba - VEBA_v2.4.0

  • [2025.1.24] - Added Initial_bins to Binette results in filter_binette_results.py
  • [2025.1.23] - Added essentials.py module
  • [2025.1.16] - Added --serialized_annotations to append_annotations_to_gff.py to avoid overhead from reparsing the annotations
  • [2025.1.15] - Fixed bug in binning_wrapper.py where script was looking for bins in the wrong directory for MetaCoAG
  • [2025.1.14] - Fixed bug in merge_annotations.py where diamond outputs were queried incorrectly
  • [2025.1.5] - Change default --busco_completeness from 50 to 30 in binning-eukaryotic.py
  • [2025.1.5] - Added --busco_options and --busco_offline arguments for binning-eukaryotic.py
  • [2024.12.28] - Added --semibin2_sequencing_type to binning_wrapper.py and added functionality for --long_reads. Moved --long_reads argument to parser_io instead of parser_featurecounts
  • [2024.12.27] - Fixed issue in consensus_domain_classification.py where softmax returns a np.array instead of a pd.DataFrame
  • [2024.12.26] - Added support for precomputed coverage for metadecoder in binning_wrapper.py
  • [2024.12.26] - Added support for binette and tiara in updated binning_prokaryotic.py module
  • [2024.12.23] - Added copy_attribute_in_gff.py script which copies attributes to a source and destination attribute
  • [2024.12.17] - Added filter_binette_results.py script
  • [2024.12.16] - Added intermediate directory to metacoag in binning_wrapper.py
  • [2024.12.12] - Added metacoag support and custom HMM support to metadecoder in binning_wrapper.py
  • [2024.12.11] - Added prepend_de-bruijn_path.py script and use this in assembly.py and assembly-long.py to prepend prefix to SPAdes/Flye de Bruijn graph paths.
  • [2024.12.10] - Changed default --minimum_genome_size to 200000 from 150000
  • [2024.12.9] - Added support for SemiBin2 and MetaDecoder in binning_wrapper.py
  • [2024.11.21] - Updated --cluster_label_mode default to md5 instead of numeric to allow for easier cluster updates post hoc. Change reflected in cluster.py, global_clustering.py, local_clustering.py, and update_genome_clusters.py
  • [2024.11.18] - Added update_genome_clusters.py which runs skani against all reference genome clusters. Does not do protein clustering nor does it update the graph, representatives, or proteins.
  • [2024.11.15] - Added --header simple to diamond output in annotate.py and accounted for change in merge_annotations.py
  • [2024.11.11] - Added Enzymes to append_annotations_to_gff.py script
  • [2024.11.9] - Added kofam.enzymes.list and kofam.pathways.list in VDB_v8.1 to provide subsets for pykofamsearch
  • [2024.11.8] - Updating VEBA database VDB_v8 to VDB_v8.1 which adds serialized KOfam with enzyme support
  • [2024.11.8] - Added Enzymes to annotate.py and merge_annotations.py [!untested]
  • [2024.11.7] - Updated pyhmmsearch and pykofamsearch version in VEBA-annotate_env.yml, VEBA-classify-eukaryotic_env.yml,VEBA-database_env, and VEBA-phylogeny_env. Also updated executables in annotate.py, classify-eukaryotic.py, phylogeny.py, and download_databases-annotate.sh.
  • [2024.11.7] - In edgelist_to_clusters.py, added --cluster_label_mode {"numeric", "random", "pseudo-random", "md5", "nodes"} to allow for different types of labels. Added --threshold2 option for a second weight.
  • [2024.11.7] - Added --wrap to fasta_utility.py and split id and descriptions in header so prefix/suffix is only added to id.
  • [2024.11.7] - Added prepend_gff.py to prepend a prefix to contig and attribute identifiers
  • [2024.11.7] - Changed default --skani_minimum_af to 50 from 15 as this is used in GTDB-Tk for determining species-level clusters in cluster.py, global_clustering.py, and local_clustering.py
  • [2024.11.6] - Added append_annotations_to_gff.py script
  • [2024.10.29] - Changed manual mode to metaeuk mode for preexisting metaeuk results

- Python
Published by jolespin over 1 year ago

veba - VEBA_v2.3.0

  • [2024.9.21] - Added KEGG Pathway Profiler to VEBA-database_env and VEBA-annotate_env which replaces MicrobeAnnotator-KEGG for module completion ratios. Replacing ${VEBA_DATABASE}/Annotate/MicrobeAnnotator-KEGG with ${VEBA_DATABASE}/Annotate/KEGG-Pathway-Profiler/ database files. Note: New module completion ratio output does not have classes labels for KEGG modules.
  • [2024.8.30] - Added ${N_JOBS} to download scripts with default set to maximum threads available

- Python
Published by jolespin over 1 year ago

veba - VEBA_v2.2.1

  • [2024.8.29] - Added VERSION file created in download_databases.sh
  • [2024.7.11] - Alignment fraction threshold for genome clustering only applied to reference but should also apply to query. Added --af_mode with either relaxed = max([Alignment_fraction_ref, Alignment_fraction_query]) > minimum_af or strict = (Alignment_fraction_ref > minimum_af) & (Alignment_fraction_query > minimum_af) to edgelist_to_clusters.py, global_clustering.py, local_clustering.py, and cluster.py.
  • [2024.7.3] - Added pigz to VEBA-annotate_env which isn't a problem with most conda installations but needed for docker containers.
  • [2024.6.21] - Changed choose_fastest_mirror.py to determine_fastest_mirror.py
  • [2024.6.20] - Added -m/--include_mrna to compile_metaeuk_identifiers.py for Issue #110

- Python
Published by jolespin over 1 year ago

veba - VEBA_v2.2.0

Disclaimer: I made some large updates in this version and I believe everything has been adequately tested but just in case anything has slipped between the cracks you can use v2.1.0 which has been thoroughly tested in accordance to the NAR Espinoza 2024 paper. Benefits of using this version include much faster and robust prokaryotic classifications and fast/scalable HMM-based annotation modeling.

Large performance updates for this version including: * Updating GTDB-Tk 2.3.0 -> 2.4.0 which means the GTDB needed to be updated from r214.1 -> r220 * VEBA-classifyenv was split up into VEBA-classify-eukaryoticenv, VEBA-classify-prokaryoticenv, and VEBA-prokaryoticenv * annotate.py, classify-eukaryotic.py, and phylogeny.py were rewritten (and their utility scripts) were updated to used PyHMMER (pyhmmsearch and pykofamsearch) which is faster than HMMSearch when multithreaded. * KOFAM was changed to KOfam

- Python
Published by jolespin almost 2 years ago

veba - VEBA_v2.1.0-zen

This is the exact same version as VEBA_v2.1.0. New VEBA releases will now automatically be synced to Zenodo.

- Python
Published by jolespin almost 2 years ago

veba - VEBA_v2.1.0

Official release of VEBA v2.1.0 with updates to address peer reviewers. Mostly documentation but also including the following:

  • [2024.4.30] - Added concatenate_files.py which can concatenate files (and mixed compressed/decompressed files) using either arguments, list file, or glob. Reason for this is that unix has a limit of arguments that can be used (e.g., cat *.fasta > output.fasta where *.fasta results in 50k files will crash)
  • [2024.4.29] - Added /volumes/workspace/ directory to Docker containers for situations when your input and output directories are the same.
  • [2024.4.29] - featureCounts can only handle 64 threads at a time so added min(64, opts.n_jobs) for all the modules/scripts that use featureCounts commands.
  • [2024.4.23] - Added uniprot_to_enzymes.py which reformats tables and fasta from https://www.uniprot.org/uniprotkb?query=ec%3A*
  • [2024.4.18] - Developed a faster CLI implementation of KofamScan called PyKofamSearch which leverage PyHmmer. This will be used in future versions of VEBA.
  • [2024.4.18] - Developed a faster CLI implementation of HMMSearch called PyHMMSearch which leverage PyHmmer. This will be used in future versions of VEBA.
  • [2024.3.26] - Added --metaeuk_split_memory_limit to metaeuk_wrapper.py.
  • [2024.3.26] - Added -d/--genome_identifier_directory_index to scaffolds_to_bins.py for directories that are structured path/to/genomes/bin_a/reference.fasta where you would use -d -2.
  • [2024.3.26] - Added --minimum_af to edgelist_to_clusters.py with an option to accept 4 column inputs [id_1]<tab>[id_2]<tab>[weight]<tab>[alignment_fraction]. global_clustering.py, local_clustering.py, and cluster.py now use this by default --af_threshold 30.0. If you want to retain previous behavior, just use --af_threshold 0.0.
  • [2024.3.18] - edgelist_to_clusters.py only includes edges where both nodes are in identifiers set. If --identifiers are provided, then only those identifiers are used. If not, then it includes all nodes.
  • [2024.3.18] - Added --export_representatives argument for edgelist_to_clusters.py to output table with [id_node]<tab>[id_cluster]<tab>[intra-cluster_connectivity]<tab>[representative]. Also includes this information in nx.Graph objects.
  • [2024.3.18] - Changed singleton weight to np.nan instead of np.inf for edgelist_to_clusters.py to allow for representative calculations.
  • YouTube channel (https://www.youtube.com/@VEBA-Multiomics)

- Python
Published by jolespin about 2 years ago

veba - VEBA_v2.1.0b (pre-release)

Beta release of VEBA v2.1.0b with updates to address peer reviewers. Mostly documentation but also including the following:

  • [2024.4.30] - Added concatenate_files.py which can concatenate files (and mixed compressed/decompressed files) using either arguments, list file, or glob. Reason for this is that unix has a limit of arguments that can be used (e.g., cat *.fasta > output.fasta where *.fasta results in 50k files will crash)
  • [2024.4.29] - Added /volumes/workspace/ directory to Docker containers for situations when your input and output directories are the same.
  • [2024.4.29] - featureCounts can only handle 64 threads at a time so added min(64, opts.n_jobs) for all the modules/scripts that use featureCounts commands.
  • [2024.4.23] - Added uniprot_to_enzymes.py which reformats tables and fasta from https://www.uniprot.org/uniprotkb?query=ec%3A*
  • [2024.4.18] - Developed a faster implementation of KofamScan called PyKofamSearch which leverage PyHmmer. This will be used in future versions of VEBA.
  • [2024.3.26] - Added --metaeuk_split_memory_limit to metaeuk_wrapper.py.
  • [2024.3.26] - Added -d/--genome_identifier_directory_index to scaffolds_to_bins.py for directories that are structured path/to/genomes/bin_a/reference.fasta where you would use -d -2.
  • [2024.3.26] - Added --minimum_af to edgelist_to_clusters.py with an option to accept 4 column inputs [id_1]<tab>[id_2]<tab>[weight]<tab>[alignment_fraction]. global_clustering.py, local_clustering.py, and cluster.py now use this by default --af_threshold 30.0. If you want to retain previous behavior, just use --af_threshold 0.0.
  • [2024.3.18] - edgelist_to_clusters.py only includes edges where both nodes are in identifiers set. If --identifiers are provided, then only those identifiers are used. If not, then it includes all nodes.
  • [2024.3.18] - Added --export_representatives argument for edgelist_to_clusters.py to output table with [id_node]<tab>[id_cluster]<tab>[intra-cluster_connectivity]<tab>[representative]. Also includes this information in nx.Graph objects.
  • [2024.3.18] - Changed singleton weight to np.nan instead of np.inf for edgelist_to_clusters.py to allow for representative calculations.

- Python
Published by jolespin about 2 years ago

veba - VEBA_v2.0.0

  • Changed default assembly algorithm to metaflye instead of flye in assembly-long.py
  • Added number_of_genomes, number_of_genome-clusters, number_of_proteins, and number_of_protein-clusters to feature_compression_ratios.tsv.gz from cluster.py
  • Added -A/--from_antismash in biosynthetic.py to use preexisting antiSMASH results. Also changed -i/--input to -i/--from_genomes.
  • Changed antimash_genbanks_to_table.py to biosynthetic_genbanks_to_table.py for future support of DeepBGC and GECCO
  • Added busco_version parameter to merge_busco_json.py with default set to 5.4.x and additional support for 5.6.x.
  • Added CONDA_ENVS_PATH to update_environment_scripts.sh, update_environment_variables.sh, and check_installation.sh
  • Added CONDA_ENVS_PATH to veba to allow for custom environment locations
  • Changed install.sh to support custom CONDA_ENVS_PATH argument bash install.sh path/to/log path/to/envs/
  • Added merge_counts_with_taxonomy.py

- Python
Published by jolespin about 2 years ago

veba - VEBA_v1.5.0

Warning: For this release, use the https://github.com/jolespin/veba/releases/download/v1.5.0/v1.5.0.zip asset not the "Source code" assets as those are out of date.

Release v1.5.0 Highlights:

  • Added VeryFastTree to phylogeny.py
  • Added --blacklist to compile_eukaryotic_classifications.py
  • Added compatibility for antismash_genbanks_to_table.py to operate on antiSMASH v7 genbanks
  • Added compile_phylogenomic_functional_categories.py script which automates the methodology from Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239)
  • Fixed error in annotations.protein_clusters.tsv formatting from annotate.py
  • Fixed situation where unbinned.fasta were not added in binning-prokaryotic.py and bad symlinks were created for GFF, rRNA, and tRNA when no genoems were detected.
  • Fixed critical error where classify_eukaryotic.py was trying to access a deprecated database file from MicroEuk_v2.
Release v1.5.0 Details * Cleaned up installation files * Changed `veba/src/` to `veba/bin/` * Checked `SCRIPT_VERSIONS` to `VEBA_SCRIPT_VERSIONS` which are now in `bin/` of conda environment * Fixed header being offset in `annotations.protein_clusters.tsv` where it could not be read with Pandas. * Fixed `binning-prokaryotic.py` the creation of non-existing symlinks where "'*.gff'", "'*.rRNA'", and "'*.tRNA'" were created. * Fixed .strip method on Pandas series in `antismash_genbanks_to_table.py` for compatibilty with `antiSMASH 6 and 7` * Fixed situation where `unbinned.fasta` is empty in `binning-prokaryotic.py` when there are no bins that pass qc. * Fixed minor error in `coverage.py` where `samtools sort --reference` was getting `reads_table.tsv` and not `reference.fasta` * Changed default behavior from deterministic to not deterministic for increase in speed in `assembly-long.py`. (i.e., `--no_deterministic` --> `--deterministic`) * Added `VeryFastTree` as an option to `phylogeny.py` with `FastTree` remaining as the default. * Changed default `--leniency` parameter on `classify_eukaryotic.py` and `consensus_genome_classification_ranked.py` to `1.0` and added `--leniecy_genome_classification` as a separate option. * Added `--blacklist` option to `compile_eukaryotic_classifications.py` with a default value of `species:uncultured eukaryote` in `classify_eukaryotic.py` * Fixed critical error where `classify_eukaryotic.py` was trying to access a deprecated database file from MicrEuk_v2. * Fixed minor error with `eukaryotic_gene_modeling_wrapper.py` not allowing for `Tiara` to run in backend. * Added `compile_phylogenomic_functional_categories.py` script which automates the methodology from [Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239)](https://academic.oup.com/pnasnexus/article/1/5/pgac239/6762943)

- Python
Published by jolespin over 2 years ago

veba - VEBA_v1.5.0

Release v1.5.0 Highlights:

  • Added VeryFastTree to phylogeny.py
  • Added --blacklist to compile_eukaryotic_classifications.py
  • Added compatibility for antismash_genbanks_to_table.py to operate on antiSMASH v7 genbanks
  • Added compile_phylogenomic_functional_categories.py script which automates the methodology from Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239)
  • Fixed error in annotations.protein_clusters.tsv formatting from annotate.py
  • Fixed situation where unbinned.fasta were not added in binning-prokaryotic.py and bad symlinks were created for GFF, rRNA, and tRNA when no genoems were detected.
  • Fixed critical error where classify_eukaryotic.py was trying to access a deprecated database file from MicroEuk_v2.
Release v1.5.0 Details * Cleaned up installation files * Changed `veba/src/` to `veba/bin/` * Checked `SCRIPT_VERSIONS` to `VEBA_SCRIPT_VERSIONS` which are now in `bin/` of conda environment * Fixed header being offset in `annotations.protein_clusters.tsv` where it could not be read with Pandas. * Fixed `binning-prokaryotic.py` the creation of non-existing symlinks where "'*.gff'", "'*.rRNA'", and "'*.tRNA'" were created. * Fixed .strip method on Pandas series in `antismash_genbanks_to_table.py` for compatibilty with `antiSMASH 6 and 7` * Fixed situation where `unbinned.fasta` is empty in `binning-prokaryotic.py` when there are no bins that pass qc. * Fixed minor error in `coverage.py` where `samtools sort --reference` was getting `reads_table.tsv` and not `reference.fasta` * Changed default behavior from deterministic to not deterministic for increase in speed in `assembly-long.py`. (i.e., `--no_deterministic` --> `--deterministic`) * Added `VeryFastTree` as an option to `phylogeny.py` with `FastTree` remaining as the default. * Changed default `--leniency` parameter on `classify_eukaryotic.py` and `consensus_genome_classification_ranked.py` to `1.0` and added `--leniecy_genome_classification` as a separate option. * Added `--blacklist` option to `compile_eukaryotic_classifications.py` with a default value of `species:uncultured eukaryote` in `classify_eukaryotic.py` * Fixed critical error where `classify_eukaryotic.py` was trying to access a deprecated database file from MicrEuk_v2. * Fixed minor error with `eukaryotic_gene_modeling_wrapper.py` not allowing for `Tiara` to run in backend. * Added `compile_phylogenomic_functional_categories.py` script which automates the methodology from [Espinoza et al. 2022 (doi:10.1093/pnasnexus/pgac239)](https://academic.oup.com/pnasnexus/article/1/5/pgac239/6762943)

- Python
Published by jolespin over 2 years ago

veba - VEBA_v1.4.2

  • [2023.12.21] - GTDB-Tk changed name of archaea summary file so VEBA was not adding this to final classification. Fixed this in classify-prokaryotic.py.
  • [2023.12.20] - Fixed files not being closed in compile_custom_humann_database_from_annotations.py and added options to use different annotation file formats (i.e., multilevel, header, and no header).

- Python
Published by jolespin over 2 years ago

veba - VEBA_v1.4.1

Release v1.4.1 Highlights:

  • VEBA Modules:

    • Added profile-taxonomic.py module which uses sylph to build a sketch database for genomes and queries the genome database for taxonomic abundance.
    • Added long read support for fastq_preprocessor, preprocess.py, assembly-long.py, coverage-long, and all binning modules.
    • Redesign binning-eukaryotic module to handle custom MetaEuk databases
    • Added new usage syntax veba --module preprocess --params “${PARAMS}” where the Conda environment is abstracted and determined automatically in the backend. Changed all the walkthroughs to reflect this change.
    • Added skani which is the new default for genome-level clustering based on ANI.
    • Added Diamond DeepClust as an alternative to MMSEQS2 for protein clustering.
  • VEBA Database (VDB_v6):

    • Completely rebuilt VEBA's Microeukaryotic Protein Database to produce a clustered database MicroEuk100/90/50 similar to UniRef100/90/50. Available on doi:10.5281/zenodo.10139450.
    • Number of sequences:

      • MicroEuk100 = 79,920,431 (19 GB)
      • MicroEuk90 = 51,767,730 (13 GB)
      • MicroEuk50 = 29,898,853 (6.5 GB)
    • Number of source organisms per dataset:

      • MycoCosm = 2503
      • PhycoCosm = 174
      • EnsemblProtists = 233
      • MMETSP = 759
      • TARA_SAGv1 = 8
      • EukProt = 366
      • EukZoo = 27
      • TARA_SMAGv1 = 389
      • NR_Protists-Fungi = 48217
**Release v1.4.0 Details** * [2023.12.15] - Added `profile-taxonomic.py` module which uses `sylph` to build a sketch database for genomes and queries the genome database similar to `Kraken` for taxonomic abundance. * [2023.12.14] - Removed requirement to have `--estimated_assembly_size` for Flye per [Flye Issue #652](https://github.com/fenderglass/Flye/issues/652). * [2023.12.14] - Added `sylph` to `VEBA-profile_env` for abundance profiling of genomes. * [2023.12.13] - Dereplicate duplicate contigs in `concatenate_fasta.py`. * [2023.12.12] - Added `--reference_gzipped` to `index.py` and `mapping.py` with new default being that the reference fasta is not gzipped. * [2023.12.11] - Added `skani` as new default for genome clustering in `cluster.py`, `global_clustering.py`, and `local_clustering.py`. * [2023.12.11] - Added support for long reads in `fastq_preprocessor`, `preprocess.py`, `assembly-long.py`, `coverage-long`, and all binning modules. * [2023.11.28] - Fixed `annotations.protein_clusters.tsv.gz` from `merge_annotations.py` added in patch update of `v1.3.1`. * [2023.11.14] - Added support for missing values in `compile_eukaryotic_classifications.py`. * [2023.11.13] - Added `--metaeuk_split_memory_limit` argument with (experimental) default set to `36G` in `binning-eukaryotic.py` and `eukaryotic_gene_modeling.py`. * [2023.11.10] - Added `--compressed 1` to `mmseqs createdb` in `download_databases.sh` installation script. * [2023.11.10] - Added a check to `check_fasta_duplicates.py` and `clean_fasta.py` to make sure there are no `>` characters in fasta sequence caused from concatenating fasta files that are missing linebreaks. * [2023.11.10] - Added `Diamond DeepClust` to `clustering_wrapper.py`, `global/local_clustering.py`, and `cluster.py`. Changed `mmseqs2_wrapper.py` to `clustering_wrapper.py`. Changed `easy-cluster` and `easy-linclust` to `mmseqs-cluster` and `mmseqs-linclust`. * [2023.11.9] - Fixed viral quality in `merge_genome_quality_assessments.py` * [2023.11.3] - Changed `consensus_genome_classification.py` to `consensus_genome_classification_ranked.py`. Also, default behavior to allow for missing taxonomic levels. * [2023.11.2] - Fixed the `merge_annotations.py` resulting in a memory leak when creating the `annotations.protein_clusters.tsv.gz` output table. However, still need to correct the formatting for empty sets and string lists.

- Python
Published by jolespin over 2 years ago

veba - VEBA_v1.3.0

Release v1.3.0:

  • VEBA Modules:

    • Added profile-pathway.py module and associated scripts for building HUMAnN databases from de novo genomes and annotations. Essentially, a reads-based functional profiling method via HUMAnN using binned genomes as the database.
    • Added marker_gene_clustering.py script which identifies core marker proteins that are present in all genomes within a genome cluster (i.e., pangenome) and unique to only that genome cluster. Clusters in either protein or nucleotide space.
    • Added module_completion_ratios.py script which calculates KEGG module completion ratios for genomes and pangenomes. Automatically run in backend of annotate.py.
    • Updated annotate.py and merge_annotations.py to provide better annotations for clustered proteins.
    • Added merge_genome_quality.py and merge_taxonomy_classifications.py which compiles genome quality and taxonomy, respectively, for all organisms.
    • Added BGC clustering in protein and nucleotide space to biosynthetic.py. Also, produces prevalence tables that can be used for further clustering of BGCs.
    • Added pangenome_core_sequences in cluster.py writes both protein and CDS sequences for each genome cluster.
    • Added PDF visualization of newick trees in phylogeny.py.
  • VEBA Database (VDB_v5.2):

    • Added CAZy
    • Added MicrobeAnnotator-KEGG
**Release v1.3.0 Details** * Update `annotate.py` and `merge_annotations.py` to handle `CAZy`. They also properly address clustered protein annotations now. * Added `module_completion_ratio.py` script which is a fork of `MicrobeAnnotator` [`ko_mapper.py`](https://github.com/cruizperez/MicrobeAnnotator/blob/master/microbeannotator/pipeline/ko_mapper.py). Also included a database [Zenodo: 10020074](https://zenodo.org/records/10020074) which will be included in `VDB_v5.2` * Added a checkpoint for `tRNAscan-SE` in `binning-prokaryotic.py` and `eukaryotic_gene_modeling_wrapper.py`. * Added `profile-pathway.py` module and `VEBA-profile_env` environments which is a wrapper around `HUMAnN` for the custom database created from `annotate.py` and `compile_custom_humann_database_from_annotations.py` * Added `GenoPype version` to log output * Added `merge_genome_quality.py` which combines `CheckV`, `CheckM2`, and `BUSCO` results. * Added `compile_custom_humann_database_from_annotations.py` which compiles a `HUMAnN` protein database table from the output of `annotate.py` and taxonomy classifications. * Added functionality to `merge_taxonomy_classifications.py` to allow for `--no_domain` and `--no_header` which will serve as input to `compile_custom_humann_database_from_annotations.py` * Added `marker_gene_clustering.py` script which gets core marker genes unique to each SLC (i.e., pangenome). `average_number_of_copies_per_genome` to protein clusters. * Added `--minimum_core_prevalence` in `global_clustering.py`, `local_clustering.py`, and `cluster.py` which indicates prevalence ratio of protein clusters in a SLC will be considered core. Also remove `--no_singletons` from `cluster.py` to avoid complications with marker genes. Relabeled `--input` to `--genomes_table` in clustering scripts/module. * Added a check in `coverage.py` to see if the `mapped.sorted.bam` files are created, if they are then skip them. Not yet implemented for GNU parallel option. * Changed default representative sequence format from table to fasta for `mmseqs2_wrapper.py`. * Added `--nucleotide_fasta_output` to `antismash_genbank_to_table.py` which outputs the actual BGC DNA sequence. Changed `--fasta_output` to `--protein_fasta_output` and added output to `biosynthetic.py`. Changed BGC component identifiers to `[bgc_id]_[position_in_bgc]|[start]:[end]([strand])` to match with `MetaEuk` identifiers. Changed `bgc_type` to `protocluster_type`. `biosynthetic.py` now supports GFF files from `MetaEuk` (exon and gene features not supported by `antiSMASH`). Fixed error related to `antiSMASH` adding CDS (i.e., `allorf_[start]_[end]`) that are not in GFF so `antismash_genbank_to_table.py` failed in those cases. * Added `ete3` to `VEBA-phylogeny_env.yml` and automatically renders trees to PDF. * Added presets for `MEGAHIT` using the `--megahit_preset` option. * The change for using `--mash_db` with `GTDB-Tk` violated the assumption that all prokaryotic classifications had a `msa_percent` field which caused the cluster-level taxonomy to fail. `compile_prokaryotic_genome_cluster_classification_scores_table.py` fixes this by uses `fastani_ani` as the weight when genomes were classified using ANI and `msa_percent` for everything else. Initial error caused unclassified prokaryotic for all cluster-level classifications. * Fixed small error where empty gff files with an asterisk in the name were created for samples that didn't have any prokaryotic MAGs. * Fixed critical error where descriptions in header were not being removed in `eukaryota.scaffolds.list` and did not remove eukaryotic scaffolds in `seqkit grep` so `DAS_Tool` output eukaryotic MAGs in `identifier_mapping.tsv` and `__DASTool_scaffolds2bin.no_eukaryota.txt` * Fixed `krona.html` in `biosynthetic.py` which was being created incorrectly from `compile_krona.py` script. * Create `pangenome_core_sequences` in `global_clustering.py` and `local_clustering.py` which writes both protein and CDS sequences for each SLC. Also made default in `cluster.py` to NOT do local clustering switching `--no_local_clustering` to `--local_clustering`. * `pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects` in `biosynthetic.py` when `Diamond` finds multiple regions in one hit that matches. Added `--sort_by` and `--ascending` to `concatenate_dataframes.py` along with automatic detection and removal of duplicate indices. Also added `--sort_by bitscore` in `biosynthetic.py`. * Added core pangenome and singleton hits to clustering output * Updated `--megahit_memory` default from 0.9 to 0.99 * Fixed error in `genomad_taxonomy_wrapper.py` where `viral_taxonomy.tsv` should have been `taxonomy.tsv`. * Fixed minor error in `assembly.py` that was preventing users from using `SPAdes` programs that were not `spades.py`, `metaspades.py`, or `rnaspades.py` that was the result of using an incorrect string formatting. * Updated `bowtie2` in preprocess, assembly, and mapping modules. Updated `fastp` and `fastq_preprocessor` in preprocess module.

- Python
Published by jolespin over 2 years ago

veba - VEBA_v1.2.0

Release v1.2.0:

  • Fixed minor error in binning-prokaryotic.py where the --veba_database argument wasn't utilized and only the environment variable VEBA_DATABASE could be used.
  • Updated the Docker images to have /volumes/input, /volumes/output, and /volumes/database directories to mount.
  • Replaced prodigal with pyrodigal as it is faster and under active development.
  • Added support for missing classifications in compile_krona.py and consensus_genome_classification.py.
  • Updated GTDB-Tk from version 2.1.32.3.0 and GTDB from version r202_v2r214. Changed ${VEBA_DATABASE}/Classify/GTDBTk${VEBA_DATABASE}/Classify/GTDB. Added gtdb_r214.msh to GTDB database for ANI screening.
  • Added pangenome and singularity tables to cluster.py (and associated global/local clustering scripts) to output automatically.
  • Added compile_gff.py to merge CDS, rRNA, and tRNA GFF files. Used in binning-prokaryotic.py and binning-viral.py. binning-eukaryotic.py uses the source of this in the backend of filter_busco_results.py. Includes GC content for contigs and various tags.
  • Updated BUSCO v5.3.2 -> v5.4.3 which changes the json output structure and made the appropriate changes in filter_busco_results.py.
  • Added eukaryotic_gene_modeling_wrapper.py which 1) splits nuclear, mitochondrial, and plastid genomes; 2) performs gene modeling via MetaEuk and Pyrodigal; 3) performs rRNA detection via BARRNAP; 4) performs tRNA detection via tRNAscan-SE; 5) merges processed GFF files; and 5) calculates sequences statistics.
  • Added gene_biotype=protein_coding to P(y)rodigal(-GV) GFF output.
  • Added VFDB to annotate.py and database.
  • Compiled and pushed gtdb_r214.msh mash file to Zenodo:8048187 which is now used by default in classify-prokaryotic.py. It is now included in VDB_v5.1.
  • Cleaned up global and local clustering intermediate files. Added pangenome tables and singelton information to outputs.

- Python
Published by jolespin almost 3 years ago

veba - VEBA_v1.1.2

Release v1.1.2
  • Created Docker images for all modules
  • Replaced all absolute path symlinks with relative symlinks
  • Changed prokaryotic_taxonomy.tsv and prokaryotic_taxonomy.clusters.tsv in classify-prokaryotic.py (along with eukaryotic and viral) files to taxonomy.tsv and taxonomy.clusters.tsv for uniformity.
  • Updating all symlinks to relative links (also in fastq_preprocessor) to prepare for dockerization and updating all environments to use updated GenoPype 2023.4.13.
  • Changed nr to uniref in annotate.py and added propagate_annotations_from_representatives.py script while simplifying merge_annotations_and_taxonomy.py to merge_annotations.py and excluding taxonomy operations.
  • Changed nr to UniRef90 and UniRef50 in VDB_v5
  • Changed orfs_to_orthogroups.tsv to proteins_to_orthogroups.tsv for consistency with the cluster.py module. Will eventually find some consitency with scaffolds_to_bins/scaffolds_to_mags but this will be later.
  • Added a scaffolds_to_mags.tsv in the clustering output.
  • Added convert_counts_table.py which converts a counts table (and metadata) to Pandas pickle, Anndata h5ad, or Biom hdf5
  • Fixed output directory for mapping.py which now uses output_directory/${NAME} structure like binning-*.py.
  • Removed "python" prefix for script calls and now uses shebang in script for executable. Also added single paranthesis around script filepath (e.g., '[script_filepath]') to escape characters/spaces in filepath.
  • Added support for index.py to accept individual --references [file.fasta] and --gene_models [file.gff].
  • Added stdin support for scaffolds_to_bins.py along with the ability to input genome tables [id_genome][filepath]. Also added progress bars.
  • As a result of issues/22, assembly.py, assembly-sequential.py, binning-*.py, and mapping.py will use -p --countReadPairs for featureCounts and updates subread 2.0.1 -> subread 2.0.3. For binning-*.py, long reads can be used with the --long_reads flag.
  • Updated cluster.py and associated global_clustering.py/local_clustering.py scripts to use mmseqs2_wrapper.py which now automatically outputs representative sequences.
  • Added check_fasta_duplicates.py script that gives 0 and 1 exit codes for fasta without and with duplicates, respectively. Added reformat_representative_sequences.py to reformat representative sequences from MMSEQS2 into either a table or fasta file where the identifers are cluster labels. Removed --dbtype from [global/local]_clustering.py. Removed appended prefix for .graph.pkl and dict.pkl in edgelist_to_clusters.py. Added mmseqs2_wrapper.py and hmmer_wrapper.py scripts.
  • Added an option to merge_generalized_mapping.py to include the sample index in a filepath and also an option to remove empty features (useful for Salmon). Added an executable='/bin/bash' option to the subprocess.Popen calls in GenoPype to address issues/23.
  • Added genbanks/[id_genome]/ to output directory of biosynthetic.py which has symlinks to all the BGC genbanks from antiSMASH.

- Python
Published by jolespin about 3 years ago

veba - VEBA_v1.1.1

Minor updates from v1.1.0.

  • Most important update includes fixing a broken VEBA-binning-viral.yml install recipe which had package conflicts for aria2 https://github.com/jolespin/veba/commit/30e8b0a6aa6612c4db201423b304fc57362f996b.
  • Fixes on conda-related environment variables in the install scripts.
  • Added MIBiG to database and annotate.py
  • Added a composite label for annotations in annotate.py
  • Added --dastool_minimum_score to binning-prokaryotic.py module
  • Added a wrapper around STAR aligner
  • Updated merge_generalized_mapping.py script to take in BAM files instead of being dependent on a specific directory.
  • Added option to have no header in subst_table.py

- Python
Published by jolespin about 3 years ago

veba - VEBA_v1.1.0

Release v1.1.0
  • Modules:

    • annotate.py
      • Added NCBIfam-AMRFinder AMR domain annotations
      • Added AntiFam contimination annotations
      • Uses taxopy instead of ete3 in backend with merge_annotations_and_score_taxonomy.py
    • assembly.py
      • Added a transcripts_to_genes.py script which creates a genes_to_transcripts.tsv table that can be used with TransDecoder.
    • binning-prokaryotic.py
      • Updated CheckMCheckM2. This removes the dependency of GTDB-Tk and EXTREMELY REDUCES compute resource requirements (e.g., memory and time) as CheckM2 automatically handles candidate phyla radiation. With this, several backend scripts were deprecated. This cleans up the binning pipeline and error messages SUBSTANTIALLY.
      • Uses binning_wrapper.py for all binning. This makes it easier to add new binning algorithms in the future (e.g., VAMB). Also, check out the new multi-split binning functionality described below.
      • Added --skip_concoct in addition to the already existing --skip_maxbin2 option as MaxBin2 takes very long when there's a lot of contigs and CONCOCT takes a long time when there are a lot of samples (i.e., BAM files). MetaBAT2 is not optional.
    • binning-viral.py
      • Complete rewrite of this module which now uses geNomad as the default binning algorithm but still supports VirFinder.
      • If VirFinder is used, the genomad annotate is run via the genomad_taxonomy_wrapper.py script included in the update.
      • Updated ProdigalProdigal-GV to handle additional viral genetic codes.
    • biosynthetic.py
      • Introduces component_id and bgc_id which are unique, pareseable, and informative. For example, component_id = SRR17458614__CONCOCT__P.2__9|NODE_3319_length_2682_cov_2.840502|region001_1|2-2681(+) contains the unique bgc_id (i.e., SRR17458614__CONCOCT__P.2__9|NODE_3319_length_2682_cov_2.840502|region001), shows that it is the 1st gene in the cluster (the _1 in region001_1), and the gene start/end/strand. The bgc_id is composed of the genome_id|contig_id|region_id.
    • classify-prokaryotic.py
      • Updated GTDB-Tk v2.1.1GTDB-Tk v2.2.3. For now, --skip_ani_screen is the only option because of this thread. However, --mash_db may be an option in the near future.
      • Added functionality to classify prokaryotic genomes that were not binned via VEBA which is available with the --genomes option (--prokaryotic_binning_directory is still available which can leverage existing intermediate files).
    • classify-eukaryotic.py
      • Added functionality to classify eukaryotic genomes that were not binned via VEBA which is available with the --genomes option (--eukaryotic_binning_directory is still available which can leverage existing intermediate files). This is implemented by using the eukaryota_odb10 markers from the VEBA Microeukaryotic Database to substantially improve performance and decrease resources required for gene models.
    • classify-viral.py
      • Complete rewrite of this module which does not rely on (deprecated) intermediate files from CheckV.
      • Uses taxonomy generated from geNomad and consensus_genome_classification_unranked.py (a wrapper around taxopy) that can handle the chaotic taxonomy of viruses.
      • Added functionality to classify viral genomes that were not binned via VEBA which is available with the --genomes option (--viral_binning_directory is still available which can leverage existing intermediate files).
    • cluster.py
      • Complete rewrite of this module which now uses MMSEQS2 as the orthogroup detection algorithm instead of OrthoFinder. OrthoFinder is overkill for creating protein clusters and it generates thousands of intermediate files (e.g., fasta, alignments, trees, etc.) which substantially increases the compute time. MMSEQS2 has very similar performance with a fraction of the resources and compute time. Clustered the entire Plastisphere dataset on a local machine in ~30 minutes compared to several days on a HPC.
      • Now that the resources are minimal, clustering is performed at global level as before (i.e., all samples in the dataset) and now at the local level, optionally but ON by default, which clusters all genomes within a sample. Accompanying wrapper scripts are global_clustering.py and local_clustering.py.
      • The genomic and functional feature compression ratios (FCR) (described here]) are now calculated automatically. The calculation is 1 - number_of_clusters/number_of_features which can easily be converted into an unsupervised biodiversity metric. This is calculated at the global (original implementation) and local levels.
      • Input is now a table with the following columns: [organism_type]<tab>[id_sample]<tab>[id_mag]<tab>[genome]<tab>[proteins] and is generated easily with the compile_genomes_table.py script. This allows clustering to be performed for prokaryotes, eukaryotes, and viruses all at the same time.
      • SLC-specific orthogroups (SSO) are now refered to as SLC-specific protein clusters (SSPC).
      • Support zfilling (e.g., zfill=3, SLC7 → SLC007) for genomic and protein clusters.
      • Deprecated fastani_to_clusters.py to now use the more generalizable edgelist_to_clusters.py which is used for both genomic and protein clusters. This also outputs a NetworkX graph and a pickled dictionary {"cluster_a":{"component_1", "component_2", ..., "component_n"}}
    • phylogeny.py
      • Updated MUSCLE to v5 which has -align and -super5 algorithms which are now accessible with --alignment_algorithm. Cannot use stdin so now the fasta files are not gzipped. The merge_msa.py now output uncompressed fasta as default and can output gzipped with the --gzip flag.
  • VEBA Database:

    • VDB_v3.1VDB_v4
      • Updated CheckV DB v1.0CheckV DB v1.5
      • Added geNomad DB v1.2
      • Added CheckM2 DB
      • Removed CheckM DB
      • Removed taxa.sqlite and taxa.sqlite.traverse.pkl
      • Added reference.eukaryota_odb10.list and corresponding MMSEQS2 database (i.e., microeukaryotic.eukaryota_odb10)
      • Added NCBIfam-AMRFinder marker set for annotation
      • Added AntiFam marker set for contamination
      • Marker sets HMMs are now all gzipped (previously could not gzip because CheckM CPR workflow)
  • Scripts:

    • Added:
      • append_geneid_to_transdecoder_gff.py
      • bowtie2_wrapper.py
      • compile_genomes_table.py
      • consensus_genome_classification_unranked.py
      • cut_table.py
      • cut_table_by_column_labels.py
      • drop_missing_values.py
      • edgelist_to_clusters.py
      • filter_checkm2_results.py
      • genomad_taxonomy_wrapper.py
      • global_clustering.py
      • local_clustering.py
      • partition_multisplit_bins.py
      • scaffolds_to_clusters.py
      • scaffolds_to_samples.py
      • transcripts_to_genes.py
      • transdecoder_wrapper.py (Note: Requires separate environment to run due to dependency conflicts)
    • Updated:
      • antismash_genbanks_to_table.py - Added option to output biosynthetic gene cluster (BGC) fasta. Adds unique (and parseable) BGC identifiers making the output much more useful.
      • binning_wrapper.py - This binning wrapper now includes functionality to use multi-split binning (i.e., concatenated contigs from different assemblies, map all reads to the contigs, bin all together, and then parition bins by sample). This concept AFAIK was first introduced in the VAMB paper.
      • compile_reads_table.py - Minimal change but now the extension excludes the . to make usage more consistent with other tools.
      • consensus_genome_classification.py - Changed the output to match that of consensus_genome_classification_unranked.py.
      • filter_checkv_results.py - Option to use taxonomy and viral summaries generated by geNomad.
      • scaffolds_to_bins.py - Support for getting scaffolds to bins for a list of genomes via --genomes argument while maintaining original support with --binning_directory argument.
      • subset_table.py - Added option to set index column and to drop duplicates.
      • virfinder_wrapper.r - Used to be VirFinder_wrapper.R. This now has an option to use FDR values instead of P values.
      • merge_annotations_and_score_taxonomy.py - Completely rewritten. Uses taxopy instead of ete3.
      • merge_msa.py - Output uncompressed protein fasta files by default and can compress with --gzip flag.
    • Deprecated:
      • adjust_genomes_for_cpr.py
      • filter_checkm_results.py
      • fastani_to_clusters.py
      • partition_orthogroups.py
      • partition_clusters.py
      • compile_viral_classifications.py
      • build_taxa_sqlite.py
  • Miscellaneous:

    • Updated environments and now add versions to environments.
    • Added mamba to installation to speed up.
    • Added transdecoder_wrapper.py which is a wrapper around TransDecoder with direct support for Diamond and HMMSearch homology searches. Also includes append_geneid_to_transdecoder_gff.py which is run in the backend to clean up the GFF file and make them compatible with what is output by Prodigal and MetaEuk runs of VEBA.
    • Added support for using n_jobs -1 to use all available threads (similar to scikit-learn methodology).

- Python
Published by jolespin about 3 years ago

veba - VEBA_v1.0.4

Release v1.0.4
  • Added biopython to VEBA-assembly_env which is needed when running MEGAHIT as the scaffolds are rewritten and an error was raised. aea51c3
  • Updated Microeukaryotic protein database to exclude a few higher eukaryotes that were present in database, changed naming scheme to hash identifiers (from cat reference.faa | seqkit fx2tab -s -n > id_to_hash.tsv). Switching database from FigShare to Zenodo. Uses database version VDB_v3 which has the updated microeukaryotic protein database (VDB-Microeukaryotic_v2) 0845ba6

- Python
Published by jolespin over 3 years ago

veba - VEBA_v1.0.3e

If you have 1.0.3 ≤ version < 1.0.3e, you can update easily on Patch Fix #1

Release v1.0.3e
  • Patch fix for install_veba.sh where install/environments/VEBA-assembly_env.yml raised a compatibilty error when creating the VEBA-assembly_env environment c2ab957
  • Patch fix for VirFinder_wrapper.R where __version__ = variable was throwing an R error when running binning-viral.py module. 19e8f38
  • Patch fix for filter_busco_results.py where an error arose that produced empty identifier_mapping.metaeuk.tsv subset tables. 359e4569
  • Patch fix for compile_metaeuk_identifiers.py where a Python error arised when duplicate gene identifiers were present. c248527
  • Patch fix for install_veba.sh where install/environments/VEBA-preprocess_env.yml raised a compatibilty error when creating the VEBA-preprocess_env environment 8ed6eea

  • Added biosynthetic.py module which runs antiSMASH and converts genbank files to tabular format. 6c0ed82
  • Added megahit support for assembly.py module (not yet available in assembly-sequential.py). 6c0ed82
  • Changed -P/--spades_program to -P/--program for assembly.py. 6c0ed82
  • Replaced penultimate step in binning-prokaryotic.py to use adjust_genomes_for_cpr.py instead of the extremely long series of bash commands. This will make it easier to diagnose errors in this critical step. 6c0ed82
  • Added support for contig descriptions and added MAG identifier in fasta files in binning-eukaryotic.py. Now uses the metaeuk_wrapper.py script for the MetaEuk step. 6c0ed82
  • Added separate option of --run_metaplasmidspades for assembly-sequential.py instead of making it mandatory (now it just runs biosyntheticSPAdes and metaSPAdes by default). 6c0ed82
  • Added --use_mag_as_description in parition_gene_models.py script to include the MAG identifier in the contig description of the fasta header which is default in binning-prokaryotic.py. 6c0ed82
  • Added adjust_genomes_for_cpr.py script to easier run and understand the CPR adjustment step of binning-prokaryotic.py. 6c0ed82
  • Added support for fasta header descriptions in binning-prokaryotic.py. 6c0ed82
  • Added functionality to replace_fasta_descriptions.py script to be able to use a string for replacing fasta headers in addition to the original functionality. 6c0ed82

- Python
Published by jolespin over 3 years ago

veba - VEBA_v1.0.2a

Release v1.0.2a

Not to be confused with v1.0.2 which is deprecated

  • Updated GTDB-Tk in VEBA-binning-prokaryotic_env from 1.x to 2.x (this version uses much less memory): f3507dd
  • Updated the GTDB-Tk database from R202 to R207_v2 to be compatible with GTDB-Tk v2.x: f3507dd
  • Updated the GRCh38 no-alt analysis set to T2T CHM13v2.0 for the default human reference: 5ccb4e2
  • Added an experimental amplicon.py module for short-read ASV detection via the DADA2 workflow of QIIME2: cd4ed2b
  • Added additional functionality to compile_reads_table.py to handle advanced parsing of samples from fastq directories while also maintaining support for parsing filenames from veba_output/preprocess: cd4ed2b
  • Added sra-tools to VEBA-preprocess_env: f3507dd
  • Fixed symlinks to scripts for install_veba.sh: d1fad03
  • Added missing CHECKM_DATA_PATH environment variable to VEBA-binning-prokaryotic_env and VEBA-classify_env: d1fad03
  • ⚠️ In this version, contigs/scaffolds cannot have descriptions in fasta header for prokaryotic binning (Fixed in versions after 2022.11.07)

Module Versions:

amplicon.py __version__ = "2022.10.24" annotate.py __version__ = "2021.7.8" assembly.py __version__ = "2022.03.25" binning-eukaryotic.py __version__ = "2022.10.20" binning-prokaryotic.py __version__ = "2022.10.25" binning-viral.py __version__ = "2022.7.13" classify-eukaryotic.py __version__ = "2022.7.8" classify-prokaryotic.py __version__ = "2022.06.07" classify-viral.py __version__ = "2022.7.13" cluster.py __version__ = "2022.10.16" coverage.py __version__ = "2022.06.03" index.py __version__ = "2022.02.17" mapping.py __version__ = "2022.8.17" phylogeny.py __version__ = "2022.06.22" preprocess.py __version__ = "2022.01.19" scripts/append_geneid_to_prodigal_gff.py __version__ = "2021.06.19" scripts/binning_wrapper.py __version__ = "2022.04.11" scripts/build_taxa_sqlite.py __version__ = "2022.04.18" scripts/check_scaffolds_to_bins.py __version__ = "2021.08.20" scripts/compile_binning.py __version__ = "2022.03.23" scripts/compile_eukaryotic_classifications.py __version__ = "2022.7.8" scripts/compile_metaeuk_identifiers.py __version__ = "2022.03.18" scripts/compile_reads_table.py __version__ = "2022.10.24" scripts/compile_scaffold_identifiers.py __version__ = "2022.02.23" scripts/compile_viral_classifications.py __version__ = "2022.03.08" scripts/concatenate_dataframes.py __version__ = "2022.03.24" scripts/concatenate_fasta.py __version__ = "2022.02.17" scripts/concatenate_gff.py __version__ = "2022.02.17" scripts/consensus_domain_classification.py __version__ = "2022.02.28" scripts/consensus_genome_classification.py __version__ = "2022.7.13" scripts/consensus_orthogroup_annotation.py __version__ = "2022.02.02" scripts/determine_trim_position.py __version__ = "2022.8.11" scripts/fasta_to_saf.py __version__ = "2021.04.04" scripts/fasta_utility.py __version__ = "2021.07.31" scripts/fastani_to_clusters.py __version__ = "2021.11.16" scripts/fastq_position_statistics.py __version__ = "2022.10.24" scripts/filter_busco_results.py __version__ = "2022.04.04" scripts/filter_checkm_results.py __version__ = "2022.03.28" scripts/filter_checkv_results.py __version__ = "2021.08.10" scripts/filter_hmmsearch_results.py __version__ = "2021.06.16" scripts/genome_coverage_from_spades.py __version__ = "2022.7.14" scripts/genome_spatial_coverage.py __version__ = "2022.08.17" scripts/groupby_table.py __version__ = "2022.08.17" scripts/hmmer_to_proteins.py __version__ = "2021.08.03" scripts/insert_column_to_table.py __version__ = "2022.03.24" scripts/merge_annotations_and_score_taxonomy.py __version__ = "2021.08.25" scripts/merge_busco_json.py __version__ = "2022.03.10" scripts/merge_contig_mapping.py __version__ = "2022.06.27" scripts/merge_fastq_statistics.py __version__ = "2022.03.08" scripts/merge_gtdbtk.py __version__ = "2022.03.24" scripts/merge_msa.py __version__ = "2022.06.21" scripts/merge_orf_mapping.py __version__ = "2021.03.27" scripts/metaeuk_wrapper.py __version__ = "2022.08.27" scripts/partition_clusters.py __version__ = "2021.08.12" scripts/partition_gene_models.py __version__ = "2021.08.24" scripts/partition_hmmsearch.py __version__ = "2022.06.20" scripts/partition_multisplit_bins.py __version__ = "2022.04.08" scripts/partition_orthogroups.py __version__ = "2022.04.01" scripts/partition_unbinned.py __version__ = "2021.08.05" scripts/replace_fasta_descriptions.py __version__ = "2022.9.1" scripts/scaffolds_to_bins.py __version__ = "2021.03.26" scripts/subset_table.py __version__ = "2022.04.20" scripts/subset_table_by_column.py __version__ = "2022.04.20"

- Python
Published by jolespin over 3 years ago

veba - VEBA_v1.0.1

Small patch fix: * Fixed the fatal binning-eukaryotic.py error: https://github.com/jolespin/veba/commit/7c5addf9ed6e8e45502274dd353f20b211838a41 * Fixed the minor file naming in cluster.py: https://github.com/jolespin/veba/commit/58038451dac0791899aa7fca3f9d79454cb9ed46 * Removes left-over human genome tar.gz during database download/config: https://github.com/jolespin/veba/commit/58038451dac0791899aa7fca3f9d79454cb9ed46 * ⚠️ In this version, contigs/scaffolds cannot have descriptions in fasta header for prokaryotic binning (Fixed in versions after 2022.11.07)

Module Versions:

annotate.py __version__ = "2021.7.8" assembly.py __version__ = "2022.03.25" binning-eukaryotic.py __version__ = "2022.10.20" binning-prokaryotic.py __version__ = "2022.7.8" binning-viral.py __version__ = "2022.7.13" classify-eukaryotic.py __version__ = "2022.7.8" classify-prokaryotic.py __version__ = "2022.06.07" classify-viral.py __version__ = "2022.7.13" cluster.py __version__ = "2022.10.16" coverage.py __version__ = "2022.06.03" index.py __version__ = "2022.02.17" mapping.py __version__ = "2022.8.17" phylogeny.py __version__ = "2022.06.22" preprocess.py __version__ = "2022.01.19" scripts/append_geneid_to_prodigal_gff.py __version__ = "2021.06.19" scripts/binning_wrapper.py __version__ = "2022.04.11" scripts/build_taxa_sqlite.py __version__ = "2022.04.18" scripts/check_scaffolds_to_bins.py __version__ = "2021.08.20" scripts/compile_binning.py __version__ = "2022.03.23" scripts/compile_eukaryotic_classifications.py __version__ = "2022.7.8" scripts/compile_metaeuk_identifiers.py __version__ = "2022.03.18" scripts/compile_reads_table.py __version__ = "2021.7.18" scripts/compile_scaffold_identifiers.py __version__ = "2022.02.23" scripts/compile_viral_classifications.py __version__ = "2022.03.08" scripts/concatenate_dataframes.py __version__ = "2022.03.24" scripts/concatenate_fasta.py __version__ = "2022.02.17" scripts/concatenate_gff.py __version__ = "2022.02.17" scripts/consensus_domain_classification.py __version__ = "2022.02.28" scripts/consensus_genome_classification.py __version__ = "2022.7.13" scripts/consensus_orthogroup_annotation.py __version__ = "2022.02.02" scripts/fasta_to_saf.py __version__ = "2021.04.04" scripts/fasta_utility.py __version__ = "2021.07.31" scripts/fastani_to_clusters.py __version__ = "2021.06.16" scripts/filter_busco_results.py __version__ = "2022.04.04" scripts/filter_checkm_results.py __version__ = "2022.03.28" scripts/filter_checkv_results.py __version__ = "2021.08.10" scripts/filter_hmmsearch_results.py __version__ = "2021.06.16" scripts/genome_coverage_from_spades.py __version__ = "2022.7.14" scripts/genome_spatial_coverage.py __version__ = "2022.08.17" scripts/groupby_table.py __version__ = "2022.08.17" scripts/hmmer_to_proteins.py __version__ = "2021.08.03" scripts/insert_column_to_table.py __version__ = "2022.03.24" scripts/merge_annotations_and_score_taxonomy.py __version__ = "2021.08.25" scripts/merge_busco_json.py __version__ = "2022.03.10" scripts/merge_contig_mapping.py __version__ = "2022.06.27" scripts/merge_fastq_statistics.py __version__ = "2022.03.08" scripts/merge_gtdbtk.py __version__ = "2022.03.24" scripts/merge_msa.py __version__ = "2022.06.21" scripts/merge_orf_mapping.py __version__ = "2021.03.27" scripts/metaeuk_wrapper.py __version__ = "2022.08.27" scripts/partition_clusters.py __version__ = "2021.08.12" scripts/partition_gene_models.py __version__ = "2021.08.24" scripts/partition_hmmsearch.py __version__ = "2022.06.20" scripts/partition_multisplit_bins.py __version__ = "2022.04.08" scripts/partition_orthogroups.py __version__ = "2022.04.01" scripts/partition_unbinned.py __version__ = "2021.08.05" scripts/scaffolds_to_bins.py __version__ = "2021.03.26" scripts/subset_table.py __version__ = "2022.04.20" scripts/subset_table_by_column.py __version__ = "2022.04.20"

- Python
Published by jolespin over 3 years ago

veba - VEBA_v1.0.0

Version released for manuscript submission.

  • ⚠️ In this version, contigs/scaffolds cannot have descriptions in fasta header for prokaryotic binning (Fixed in versions after 2022.11.07)

Module Versions:

annotate.py __version__ = "2021.7.8" assembly.py __version__ = "2022.03.25" binning-eukaryotic.py __version__ = "2022.7.8" binning-prokaryotic.py __version__ = "2022.7.8" binning-viral.py __version__ = "2022.7.13" classify-eukaryotic.py __version__ = "2022.7.8" classify-prokaryotic.py __version__ = "2022.06.07" classify-viral.py __version__ = "2022.7.13" cluster.py __version__ = "2022.06.04" coverage.py __version__ = "2022.06.03" index.py __version__ = "2022.02.17" mapping.py __version__ = "2022.8.17" phylogeny.py __version__ = "2022.06.22" preprocess.py __version__ = "2022.01.19" scripts/append_geneid_to_prodigal_gff.py __version__ = "2021.06.19" scripts/binning_wrapper.py __version__ = "2022.04.11" scripts/build_taxa_sqlite.py __version__ = "2022.04.18" scripts/check_scaffolds_to_bins.py __version__ = "2021.08.20" scripts/compile_binning.py __version__ = "2022.03.23" scripts/compile_eukaryotic_classifications.py __version__ = "2022.7.8" scripts/compile_metaeuk_identifiers.py __version__ = "2022.03.18" scripts/compile_reads_table.py __version__ = "2021.7.18" scripts/compile_scaffold_identifiers.py __version__ = "2022.02.23" scripts/compile_viral_classifications.py __version__ = "2022.03.08" scripts/concatenate_dataframes.py __version__ = "2022.03.24" scripts/concatenate_fasta.py __version__ = "2022.02.17" scripts/concatenate_gff.py __version__ = "2022.02.17" scripts/consensus_domain_classification.py __version__ = "2022.02.28" scripts/consensus_genome_classification.py __version__ = "2022.7.13" scripts/consensus_orthogroup_annotation.py __version__ = "2022.02.02" scripts/fasta_to_saf.py __version__ = "2021.04.04" scripts/fasta_utility.py __version__ = "2021.07.31" scripts/fastani_to_clusters.py __version__ = "2021.06.16" scripts/filter_busco_results.py __version__ = "2022.04.04" scripts/filter_checkm_results.py __version__ = "2022.03.28" scripts/filter_checkv_results.py __version__ = "2021.08.10" scripts/filter_hmmsearch_results.py __version__ = "2021.06.16" scripts/genome_coverage_from_spades.py __version__ = "2022.7.14" scripts/genome_spatial_coverage.py __version__ = "2022.08.17" scripts/groupby_table.py __version__ = "2022.08.17" scripts/hmmer_to_proteins.py __version__ = "2021.08.03" scripts/insert_column_to_table.py __version__ = "2022.03.24" scripts/merge_annotations_and_score_taxonomy.py __version__ = "2021.08.25" scripts/merge_busco_json.py __version__ = "2022.03.10" scripts/merge_contig_mapping.py __version__ = "2022.06.27" scripts/merge_fastq_statistics.py __version__ = "2022.03.08" scripts/merge_gtdbtk.py __version__ = "2022.03.24" scripts/merge_msa.py __version__ = "2022.06.21" scripts/merge_orf_mapping.py __version__ = "2021.03.27" scripts/metaeuk_wrapper.py __version__ = "2022.08.27" scripts/partition_clusters.py __version__ = "2021.08.12" scripts/partition_gene_models.py __version__ = "2021.08.24" scripts/partition_hmmsearch.py __version__ = "2022.06.20" scripts/partition_multisplit_bins.py __version__ = "2022.04.08" scripts/partition_orthogroups.py __version__ = "2022.04.01" scripts/partition_unbinned.py __version__ = "2021.08.05" scripts/scaffolds_to_bins.py __version__ = "2021.03.26" scripts/subset_table.py __version__ = "2022.04.20" scripts/subset_table_by_column.py __version__ = "2022.04.20"

- Python
Published by jolespin over 3 years ago