Recent Releases of cazy-webscraper

cazy-webscraper - v2.3.0.4

Update URL and fix retrieval of sequences from NCBI-GenBank

- Python
Published by HobnobMancer over 1 year ago

cazy-webscraper - v2.3.0.3

Patch for incomplete NCBI reads

As flagged in issue #120, if the connection to NCBI is interrupted or terminated early an incomplete or corrupted read error is raised. try/except blocks were updated to accept these incomplete read errors, and cazy_webscraper will now re-try the connection until either a successful connection is made, or the number of reattempts is reached (which ever is achieved first).

What's Changed

  • Issue 120 ncbi by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/126

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.3.0.2...v2.3.0.3

- Python
Published by HobnobMancer over 2 years ago

cazy-webscraper - v2.3.0.2

Minor patch

Bug Fix Fixes crashing when retrieving the latest taxonomy data from NCBI for CAZyme records that are associated with multiple taxa in CAZy. * Catches and handles RunTime, NotXML and IncompleteRead errors

What's Changed

  • Doc update by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/116
  • Update config.yml by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/117
  • Catch incomplete read error by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/121

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.3.0...v2.3.0.2

- Python
Published by HobnobMancer over 2 years ago

cazy-webscraper - v2.3.0

What's Changed

  • Issue 111 + 112 uniprot by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/115

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.8...v2.3.0

New in version 2.3.0 * Downloading protein data from UniProt is several magnitudes faster than before - and should have fewer issues with using older version of bioservices - Uses bioservices mapping to map directly from NCBI protein version accession to UniProt - cw_get_uniprot_data not longer calls to NCBI and thus no longer requires an email address as a positional argument * Updated database schema: Changed Genbanks 1--* Uniprots to Genbanks *--1 Uniprots. Uniprots.uniprot_id is now listed in the Genbanks table, instead of listing Genbanks.genbank_id in the Uniprots table

  • Retrieve taxonomic classifications from UniProt

    • Use the --taxonomy/-t flag to retrieve the scientific name (genus and species) for proteins of interest
    • Adds downloaded taxonomic information to the UniprotsTaxs table
  • Improved clarrification of deleting old records when using cw_get_uniprot_data

    • Separate arguments to delete Genbanks-EC number and Genbanks-PDB accession relationships that are no longer listed in UniProt for those proteins in the local CAZyme database for proteins whom data is downloaded from UniProt
    • New args:
      • --delete_old_ec_relationships = deletes Genbank(protein)-EC number relationships no longer in UniProt
      • --delete_old_ecs = deletes EC numbers in the local db not linked to any proteins
      • --delete_old_pdb_relationships = deletes Genbank(protein)-PDB relationships no longer in UniProt
      • --delete_old_pdbs = deletes PDB accessions in the local db not linked to any proteins
  • Retrieve the local db schema

    • New command cw_get_db_schema added.
    • Retrieves the SQLite schema of a local CAZyme database and prints it to the terminal
  • Added option to skip retrieving the latest taxonomic classifications NCBI taxonomies

    • By default, when retreiving data from CAZy, cazy_webscraper retrieves the latest taxonomic classifications for proteins listed under multiple tax
    • To increase scrapping time, and to reduce burden on the NCBI-Entrez server, if this data is not needed (e.g. GTDB taxs will be use) this step can be skipped by using the new --skip_ncbi_tax flag.
    • When skipping retrieval of the latest taxa classifications from NCBI, cazy_webscraper will add the first taxa retrieved from CAZy for those proteins listed under mutliple taxa

- Python
Published by HobnobMancer about 3 years ago

cazy-webscraper - v2.2.8

Bugs and improvements

  • Addresses issue of incomplete retrieval of taxonomy data from NCBI
  • Process of retrieving taxonomy data is faster
  • PR #113

What's Changed

  • add not on cwgetuniprot before cwgetpdb by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/112
  • Add batch ncbi tax by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/113

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.7...v2.2.8

- Python
Published by HobnobMancer about 3 years ago

cazy-webscraper - v2.2.7

Fixing bugs when downloading seqs from NCBI:

  • Issue #109
  • Adding missing args to func calls
  • Accept UniProt-style accessions and non-standard NCBI accession formats that are used by NCBI
  • Combine cached seqs with recently downloaded so don't need to manually combine multiple caches if the download is interrupted multiple times
  • Remove unused args from func returns

What's Changed

  • Issue 109 by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/110

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.6...v2.2.7

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.6

Fix cazy_webscraper crashing from missing arguments to function calls when retrieving the latest taxonomic classifications for proteins when a batch on protein IDs contains an invalid ID:

Traceback (most recent call last): File "...bin/cazy_webscraper", line 33, in <module> sys.exit(load_entry_point('cazy-webscraper', 'console_scripts', 'cazy_webscraper')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".../cazy_webscraper/cazy_scraper.py", line 246, in main get_cazy_data( File "...//cazy_webscraper/cazy_scraper.py", line 355, in get_cazy_data cazy_data, successful_replacement = replace_multiple_tax( ^^^^^^^^^^^^^^^^^^^^^ File ".../cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 135, in replace_multiple_tax cazy_data, success = replace_multiple_tax_with_invalid_ids(cazy_data, args)

What's Changed

  • Fix tax invalid ids by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/108

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.5...v2.2.6

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.5

Fix import error when retrieving protein sequences from NCBI, that was introduced in version 2.2.4

What's Changed

  • Update imports by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/107

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.4...v2.2.5

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.4

What's Changed

  • Issue 99 by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/102

Fix Issue #99: Improve handling when incurring errors when retrieving data from NCBI

  1. Separate invalid IDs to IDs that suffered to failed connections
  2. Parse batches containing invalid IDs separately to and before failed connection batches

Downloaded protein sequences are cached to a FASTA file.

Updated information in the docs on caching.

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.3...v2.2.4

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.3

Address issue #100 with failing to retrieve data from UniProt, owing to changes to the UniProt API.

All alters the methods for mapping UniProt accessions to GenBank accessions - including a more robust method for assigning data from UniProt to the correct protein in the local CAZyme database.

What's Changed

  • Issue 100 uniprot by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/103

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.2...v2.2.3

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.2

  • Update closing message information
  • Update documentation
  • Update third party citations
  • Increases unit test coverage

What's Changed

  • Update closing message by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/98

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.1...v2.2.2

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.1

Fix inability to parse Entrez.NotXMLError when retrieving protein sequences from NCBI.

What's Changed

  • Fix inability to parse Entrez.NotXMLError by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/96

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.0...v2.2.1

- Python
Published by HobnobMancer almost 4 years ago

cazy-webscraper - v2.2.0

What's Changed

  • Add getting GTDB taxonomic classifications by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/94

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.13.1...v2.2.0

- Python
Published by HobnobMancer almost 4 years ago

cazy-webscraper - v2.1.3.1

Remove unused imports. Fix import error bug.

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.1.3...v2.0.13.1

- Python
Published by HobnobMancer almost 4 years ago

cazy-webscraper - v2.1.3

What's New

Retrieve the latest taxonomic information from NCBI Retrieve complete taxonomic lineages from NCBI Removed unused imports Retrieve genomic assembly data from NCBI: - Assembly name - GenBank Assembly ID - GenBank Assembly version accession - RefSeq Assembly ID

- RefSeq Assembly version accession

What's Changed

  • Issue 92 ncbi taxs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/93
  • Get genomes by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/91

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.1.1...v2.1.3

- Python
Published by HobnobMancer almost 4 years ago

cazy-webscraper - v2.1.1

What's Changed

New features

  • Aility to retrieve the latest taxonomic classifications from NCBI

Updates

  • Update program architecture: The expand module contains a sub module per external database, group modules by the external database from which data is sourced
  • Increase unit test coverage
  • Simplify using get_db_connection, takes 2 positional args, pathlib.Path object and bool
  • Updated documentation

  • Get ncbi tax lineages by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/90

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.13...v2.1.1

- Python
Published by HobnobMancer about 4 years ago

cazy-webscraper - v2.0.13

  • Remove unnecessary datafiles
  • Reduce package size

- Python
Published by HobnobMancer about 4 years ago

cazy-webscraper - v2.0.12

What's Changed

  • Fix logger by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/89
    • Fixed logger inheritance bug, the --verbose flag should be fully functional
    • Fix error when using CAZy class abbreviations
    • Increased unit test coverage
    • Updated documentation to match the latest CLI

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.10...v2.0.11

- Python
Published by HobnobMancer about 4 years ago

cazy-webscraper - v2.0.10

Add missing entry points, and update entry point names. - Add extracting seqs from the db entry point - Add api entry point

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.8

Update entry point import path and name to cw_extract_db_seqs

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.6

Summary

  • Fixed incomplete retrieval of proteins from the local CAZyme database that match the specified criteria
  • Improved clarity of data included in the Logs table
  • Changed Pdbs 1-1 Genbanks relationship to many-to-many: Pdbs 1-* Genbanks_Pdbs *-1 Genbanks
  • Made retrieval of proteins from the local CAZyme database that match the specified criteria significantly faster
  • Updated documentation
  • Finished API
  • Fixed failed JSON serialisation of data retrieved by the API
  • Add option to add prefix to filenames generated by the API
  • Use the saintBioutils package to handle logging and some file_io operations

What's Changed

  • Trouble shoot extract seqs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/75
  • Fix blank pdb accs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/79
  • Fix log table contents by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/81
  • Fix parsing config and selecting candidates of interest by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/83
  • Tidy and update docs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/84

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.5...v2.0.6

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.5

What's Changed

Pull Requests

  • Trouble shoot uniprot by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/73
  • Trouble shoot getting data from NCBI by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/74

Details

  • UniProt: cazy_webscraper can now be used successfully for retrieving data from UniProt and adding the data to the local CAZyme database. This includes retrieving:
  • UniProt accessions
  • Protein names
  • Protein sequences
  • EC number annotations
  • PDB accessions

  • GenBank: cazywebscraper can now be used to automate the retreival of protein sequences from GenBank for proteins in a local CAZyme database mathcing the users specified critieria. These protein sequences are stored in the local CAZyme database, and can be extracted to a FASTA file using cazywebscraper

  • Caching:

  • More data is cached

  • Cached data can be used to continue data retrievals from UniProt and GenBank, when a previous retrieval and/or addition of the data to the database fails

  • Improved default name of cache dirs and subdirs

  • Unit tests: Started rewrite of unit tests to match the new program architecture

  • Documentation: Updating the documentation to include the new flags/options, and adding new tutorials for rautomating the retrieval if data from UniProt, GenBank and PDB

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.3...v2.0.5

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.3

v2.0.3

Beta release of version 2.

Bug fixes

  • Fixes making only the parent dirs of an output database path
  • Fixes not finding the cazy_webscraper module

What's Changed

  • Update unit tests by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/68
  • Update unit tests by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/70
  • update v number by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/71
  • Fix output dir making by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/72

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.0...v2.0.3

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.0

Beta release of version 2.

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - pre_version_1_release

This release:

  • changes the call command for cazy_webscraper from cazy_webscraper.py to cazy_webscraper
  • fixes typos in README and docs
  • cazy_webscraper can now build all parent directories for a user specified output directory

- Python
Published by HobnobMancer about 5 years ago

cazy-webscraper - Installation integration

This release is created for integration into Bioconda and Pypi, with the main section of cazy_webscraper complete (for the retrieval of protein data from CAZy) and prior to the overhaul of the expand module.

- Python
Published by HobnobMancer about 5 years ago

cazy-webscraper - Bioconda integration

- Python
Published by HobnobMancer over 5 years ago

cazy-webscraper - Zenodo citation

New release of package for Zenodo tagging to facilitate citing the programme.

- Python
Published by HobnobMancer over 5 years ago

cazy-webscraper - First release

This is the first release of the cazy_webscraper.

Current features include:

  • Producing dataframes of the protein data in CAZy
  • Retrieving protein sequences of scraped CAZymes from GenBank
  • Retrieving protein structures of CAZymes from PDB

- Python
Published by HobnobMancer over 5 years ago