cazy-webscraper - v2.3.0.4

Update URL and fix retrieval of sequences from NCBI-GenBank

- Python
Published by HobnobMancer over 1 year ago

Patch for incomplete NCBI reads

As flagged in issue #120, if the connection to NCBI is interrupted or terminated early an incomplete or corrupted read error is raised. try/except blocks were updated to accept these incomplete read errors, and cazy_webscraper will now re-try the connection until either a successful connection is made, or the number of reattempts is reached (which ever is achieved first).

What's Changed

Issue 120 ncbi by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/126

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.3.0.2...v2.3.0.3

- Python
Published by HobnobMancer over 2 years ago

cazy-webscraper - v2.3.0.2

Minor patch

Bug Fix Fixes crashing when retrieving the latest taxonomy data from NCBI for CAZyme records that are associated with multiple taxa in CAZy. * Catches and handles RunTime, NotXML and IncompleteRead errors

What's Changed

Doc update by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/116
Update config.yml by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/117
Catch incomplete read error by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/121

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.3.0...v2.3.0.2

- Python
Published by HobnobMancer over 2 years ago

cazy-webscraper - v2.3.0

What's Changed

Issue 111 + 112 uniprot by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/115

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.8...v2.3.0

New in version 2.3.0 * Downloading protein data from UniProt is several magnitudes faster than before - and should have fewer issues with using older version of bioservices - Uses bioservices mapping to map directly from NCBI protein version accession to UniProt - cw_get_uniprot_data not longer calls to NCBI and thus no longer requires an email address as a positional argument * Updated database schema: Changed Genbanks 1--* Uniprots to Genbanks *--1 Uniprots. Uniprots.uniprot_id is now listed in the Genbanks table, instead of listing Genbanks.genbank_id in the Uniprots table

Retrieve taxonomic classifications from UniProt
- Use the --taxonomy/-t flag to retrieve the scientific name (genus and species) for proteins of interest
- Adds downloaded taxonomic information to the UniprotsTaxs table
Improved clarrification of deleting old records when using cw_get_uniprot_data
- Separate arguments to delete Genbanks-EC number and Genbanks-PDB accession relationships that are no longer listed in UniProt for those proteins in the local CAZyme database for proteins whom data is downloaded from UniProt
- New args:
  - --delete_old_ec_relationships = deletes Genbank(protein)-EC number relationships no longer in UniProt
  - --delete_old_ecs = deletes EC numbers in the local db not linked to any proteins
  - --delete_old_pdb_relationships = deletes Genbank(protein)-PDB relationships no longer in UniProt
  - --delete_old_pdbs = deletes PDB accessions in the local db not linked to any proteins
Retrieve the local db schema
- New command cw_get_db_schema added.
- Retrieves the SQLite schema of a local CAZyme database and prints it to the terminal
Added option to skip retrieving the latest taxonomic classifications NCBI taxonomies
- By default, when retreiving data from CAZy, cazy_webscraper retrieves the latest taxonomic classifications for proteins listed under multiple tax
- To increase scrapping time, and to reduce burden on the NCBI-Entrez server, if this data is not needed (e.g. GTDB taxs will be use) this step can be skipped by using the new --skip_ncbi_tax flag.
- When skipping retrieval of the latest taxa classifications from NCBI, cazy_webscraper will add the first taxa retrieved from CAZy for those proteins listed under mutliple taxa

- Python
Published by HobnobMancer about 3 years ago

cazy-webscraper - v2.2.8

Bugs and improvements

Addresses issue of incomplete retrieval of taxonomy data from NCBI
Process of retrieving taxonomy data is faster
PR #113

What's Changed

add not on cwgetuniprot before cwgetpdb by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/112
Add batch ncbi tax by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/113

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.7...v2.2.8

- Python
Published by HobnobMancer about 3 years ago

cazy-webscraper - v2.2.7

Fixing bugs when downloading seqs from NCBI:

Issue #109
Adding missing args to func calls
Accept UniProt-style accessions and non-standard NCBI accession formats that are used by NCBI
Combine cached seqs with recently downloaded so don't need to manually combine multiple caches if the download is interrupted multiple times
Remove unused args from func returns

What's Changed

Issue 109 by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/110

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.6...v2.2.7

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.6

Fix cazy_webscraper crashing from missing arguments to function calls when retrieving the latest taxonomic classifications for proteins when a batch on protein IDs contains an invalid ID:

Traceback (most recent call last): File "...bin/cazy_webscraper", line 33, in <module> sys.exit(load_entry_point('cazy-webscraper', 'console_scripts', 'cazy_webscraper')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".../cazy_webscraper/cazy_scraper.py", line 246, in main get_cazy_data( File "...//cazy_webscraper/cazy_scraper.py", line 355, in get_cazy_data cazy_data, successful_replacement = replace_multiple_tax( ^^^^^^^^^^^^^^^^^^^^^ File ".../cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 135, in replace_multiple_tax cazy_data, success = replace_multiple_tax_with_invalid_ids(cazy_data, args)

What's Changed

Fix tax invalid ids by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/108

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.5...v2.2.6

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.5

Fix import error when retrieving protein sequences from NCBI, that was introduced in version 2.2.4

What's Changed

Update imports by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/107

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.4...v2.2.5

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.4

What's Changed

Issue 99 by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/102

Fix Issue #99: Improve handling when incurring errors when retrieving data from NCBI

Separate invalid IDs to IDs that suffered to failed connections
Parse batches containing invalid IDs separately to and before failed connection batches

Downloaded protein sequences are cached to a FASTA file.

Updated information in the docs on caching.

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.3...v2.2.4

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.3

Address issue #100 with failing to retrieve data from UniProt, owing to changes to the UniProt API.

All alters the methods for mapping UniProt accessions to GenBank accessions - including a more robust method for assigning data from UniProt to the correct protein in the local CAZyme database.

What's Changed

Issue 100 uniprot by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/103

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.2...v2.2.3

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.2

Update closing message information
Update documentation
Update third party citations
Increases unit test coverage

What's Changed

Update closing message by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/98

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.1...v2.2.2

- Python
Published by HobnobMancer over 3 years ago

cazy-webscraper - v2.2.1

Fix inability to parse Entrez.NotXMLError when retrieving protein sequences from NCBI.

What's Changed

Fix inability to parse Entrez.NotXMLError by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/96

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.0...v2.2.1

- Python
Published by HobnobMancer almost 4 years ago

cazy-webscraper - v2.2.0

What's Changed

Add getting GTDB taxonomic classifications by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/94

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.13.1...v2.2.0

- Python
Published by HobnobMancer almost 4 years ago

cazy-webscraper - v2.1.3.1

Remove unused imports. Fix import error bug.

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.1.3...v2.0.13.1

- Python
Published by HobnobMancer almost 4 years ago

cazy-webscraper - v2.1.3

What's New

Retrieve the latest taxonomic information from NCBI Retrieve complete taxonomic lineages from NCBI Removed unused imports Retrieve genomic assembly data from NCBI: - Assembly name - GenBank Assembly ID - GenBank Assembly version accession - RefSeq Assembly ID

- RefSeq Assembly version accession

What's Changed

Issue 92 ncbi taxs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/93
Get genomes by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/91

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.1.1...v2.1.3

- Python
Published by HobnobMancer almost 4 years ago

cazy-webscraper - v2.1.1

What's Changed

New features

Aility to retrieve the latest taxonomic classifications from NCBI

Updates

Update program architecture: The expand module contains a sub module per external database, group modules by the external database from which data is sourced
Increase unit test coverage
Simplify using get_db_connection, takes 2 positional args, pathlib.Path object and bool
Updated documentation
Get ncbi tax lineages by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/90

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.13...v2.1.1

- Python
Published by HobnobMancer about 4 years ago

cazy-webscraper - v2.0.13

Remove unnecessary datafiles
Reduce package size

- Python
Published by HobnobMancer about 4 years ago

cazy-webscraper - v2.0.12

What's Changed

Fix logger by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/89
- Fixed logger inheritance bug, the --verbose flag should be fully functional
- Fix error when using CAZy class abbreviations
- Increased unit test coverage
- Updated documentation to match the latest CLI

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.10...v2.0.11

- Python
Published by HobnobMancer about 4 years ago

cazy-webscraper - v2.0.10

Add missing entry points, and update entry point names. - Add extracting seqs from the db entry point - Add api entry point

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.8

Update entry point import path and name to cw_extract_db_seqs

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.6

Summary

Fixed incomplete retrieval of proteins from the local CAZyme database that match the specified criteria
Improved clarity of data included in the Logs table
Changed Pdbs 1-1 Genbanks relationship to many-to-many: Pdbs 1-* Genbanks_Pdbs *-1 Genbanks
Made retrieval of proteins from the local CAZyme database that match the specified criteria significantly faster
Updated documentation
Finished API
Fixed failed JSON serialisation of data retrieved by the API
Add option to add prefix to filenames generated by the API
Use the saintBioutils package to handle logging and some file_io operations

What's Changed

Trouble shoot extract seqs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/75
Fix blank pdb accs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/79
Fix log table contents by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/81
Fix parsing config and selecting candidates of interest by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/83
Tidy and update docs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/84

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.5...v2.0.6

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.5

What's Changed

Pull Requests

Trouble shoot uniprot by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/73
Trouble shoot getting data from NCBI by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/74

Details

UniProt: cazy_webscraper can now be used successfully for retrieving data from UniProt and adding the data to the local CAZyme database. This includes retrieving:
UniProt accessions
Protein names
Protein sequences
EC number annotations
PDB accessions
GenBank: cazywebscraper can now be used to automate the retreival of protein sequences from GenBank for proteins in a local CAZyme database mathcing the users specified critieria. These protein sequences are stored in the local CAZyme database, and can be extracted to a FASTA file using cazywebscraper
Caching:
More data is cached
Cached data can be used to continue data retrievals from UniProt and GenBank, when a previous retrieval and/or addition of the data to the database fails
Improved default name of cache dirs and subdirs
Unit tests: Started rewrite of unit tests to match the new program architecture
Documentation: Updating the documentation to include the new flags/options, and adding new tutorials for rautomating the retrieval if data from UniProt, GenBank and PDB

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.3...v2.0.5

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.3

v2.0.3

Beta release of version 2.

Bug fixes

Fixes making only the parent dirs of an output database path
Fixes not finding the cazy_webscraper module

What's Changed

Update unit tests by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/68
Update unit tests by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/70
update v number by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/71
Fix output dir making by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/72

Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.0...v2.0.3

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - v2.0.0

Beta release of version 2.

- Python
Published by HobnobMancer over 4 years ago

cazy-webscraper - pre_version_1_release

This release:

changes the call command for cazy_webscraper from cazy_webscraper.py to cazy_webscraper
fixes typos in README and docs
cazy_webscraper can now build all parent directories for a user specified output directory

- Python
Published by HobnobMancer about 5 years ago

Producing dataframes of the protein data in CAZy
Retrieving protein sequences of scraped CAZymes from GenBank
Retrieving protein structures of CAZymes from PDB

- Python
Published by HobnobMancer over 5 years ago

Recent Releases of cazy-webscraper

cazy-webscraper - v2.3.0.4

cazy-webscraper - v2.3.0.3

Patch for incomplete NCBI reads

What's Changed

cazy-webscraper - v2.3.0.2

Minor patch

What's Changed

cazy-webscraper - v2.3.0

What's Changed

cazy-webscraper - v2.2.8

Bugs and improvements

What's Changed

cazy-webscraper - v2.2.7

What's Changed

cazy-webscraper - v2.2.6

What's Changed

cazy-webscraper - v2.2.5

What's Changed

cazy-webscraper - v2.2.4

What's Changed

cazy-webscraper - v2.2.3

What's Changed

cazy-webscraper - v2.2.2

What's Changed

cazy-webscraper - v2.2.1

What's Changed

cazy-webscraper - v2.2.0

What's Changed

cazy-webscraper - v2.1.3.1

cazy-webscraper - v2.1.3

What's New

- RefSeq Assembly version accession

What's Changed

cazy-webscraper - v2.1.1

What's Changed

New features

Updates

cazy-webscraper - v2.0.13

cazy-webscraper - v2.0.12

What's Changed

cazy-webscraper - v2.0.10

cazy-webscraper - v2.0.8

cazy-webscraper - v2.0.6

Summary

What's Changed

cazy-webscraper - v2.0.5

What's Changed

Pull Requests

Details

cazy-webscraper - v2.0.3

v2.0.3

Bug fixes

What's Changed

cazy-webscraper - v2.0.0

cazy-webscraper - pre_version_1_release

cazy-webscraper - Installation integration

cazy-webscraper - Bioconda integration

cazy-webscraper - Zenodo citation

cazy-webscraper - First release