Recent Releases of cazy-webscraper
cazy-webscraper - v2.3.0.4
Update URL and fix retrieval of sequences from NCBI-GenBank
- Python
Published by HobnobMancer over 1 year ago
cazy-webscraper - v2.3.0.3
Patch for incomplete NCBI reads
As flagged in issue #120, if the connection to NCBI is interrupted or terminated early an incomplete or corrupted read error is raised. try/except blocks were updated to accept these incomplete read errors, and cazy_webscraper will now re-try the connection until either a successful connection is made, or the number of reattempts is reached (which ever is achieved first).
What's Changed
- Issue 120 ncbi by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/126
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.3.0.2...v2.3.0.3
- Python
Published by HobnobMancer over 2 years ago
cazy-webscraper - v2.3.0.2
Minor patch
Bug Fix Fixes crashing when retrieving the latest taxonomy data from NCBI for CAZyme records that are associated with multiple taxa in CAZy. * Catches and handles RunTime, NotXML and IncompleteRead errors
What's Changed
- Doc update by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/116
- Update config.yml by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/117
- Catch incomplete read error by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/121
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.3.0...v2.3.0.2
- Python
Published by HobnobMancer over 2 years ago
cazy-webscraper - v2.3.0
What's Changed
- Issue 111 + 112 uniprot by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/115
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.8...v2.3.0
New in version 2.3.0
* Downloading protein data from UniProt is several magnitudes faster than before - and should have fewer issues with using older version of bioservices
- Uses bioservices mapping to map directly from NCBI protein version accession to UniProt
- cw_get_uniprot_data not longer calls to NCBI and thus no longer requires an email address as a positional argument
* Updated database schema: Changed Genbanks 1--* Uniprots to Genbanks *--1 Uniprots. Uniprots.uniprot_id is now listed in the Genbanks table, instead of listing Genbanks.genbank_id in the Uniprots table
Retrieve taxonomic classifications from UniProt
- Use the
--taxonomy/-tflag to retrieve the scientific name (genus and species) for proteins of interest - Adds downloaded taxonomic information to the
UniprotsTaxstable
- Use the
Improved clarrification of deleting old records when using
cw_get_uniprot_data- Separate arguments to delete Genbanks-EC number and Genbanks-PDB accession relationships that are no longer listed in UniProt for those proteins in the local CAZyme database for proteins whom data is downloaded from UniProt
- New args:
--delete_old_ec_relationships= deletes Genbank(protein)-EC number relationships no longer in UniProt--delete_old_ecs= deletes EC numbers in the local db not linked to any proteins--delete_old_pdb_relationships= deletes Genbank(protein)-PDB relationships no longer in UniProt--delete_old_pdbs= deletes PDB accessions in the local db not linked to any proteins
Retrieve the local db schema
- New command
cw_get_db_schemaadded. - Retrieves the SQLite schema of a local CAZyme database and prints it to the terminal
- New command
Added option to skip retrieving the latest taxonomic classifications NCBI taxonomies
- By default, when retreiving data from CAZy,
cazy_webscraperretrieves the latest taxonomic classifications for proteins listed under multiple tax - To increase scrapping time, and to reduce burden on the NCBI-Entrez server, if this data is not needed (e.g. GTDB taxs will be use) this step can be skipped by using the new
--skip_ncbi_taxflag. - When skipping retrieval of the latest taxa classifications from NCBI,
cazy_webscraperwill add the first taxa retrieved from CAZy for those proteins listed under mutliple taxa
- By default, when retreiving data from CAZy,
- Python
Published by HobnobMancer about 3 years ago
cazy-webscraper - v2.2.8
Bugs and improvements
- Addresses issue of incomplete retrieval of taxonomy data from NCBI
- Process of retrieving taxonomy data is faster
- PR #113
What's Changed
- add not on cwgetuniprot before cwgetpdb by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/112
- Add batch ncbi tax by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/113
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.7...v2.2.8
- Python
Published by HobnobMancer about 3 years ago
cazy-webscraper - v2.2.7
Fixing bugs when downloading seqs from NCBI:
- Issue #109
- Adding missing args to func calls
- Accept UniProt-style accessions and non-standard NCBI accession formats that are used by NCBI
- Combine cached seqs with recently downloaded so don't need to manually combine multiple caches if the download is interrupted multiple times
- Remove unused args from func returns
What's Changed
- Issue 109 by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/110
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.6...v2.2.7
- Python
Published by HobnobMancer over 3 years ago
cazy-webscraper - v2.2.6
Fix cazy_webscraper crashing from missing arguments to function calls when retrieving the latest taxonomic classifications for proteins when a batch on protein IDs contains an invalid ID:
Traceback (most recent call last):
File "...bin/cazy_webscraper", line 33, in <module>
sys.exit(load_entry_point('cazy-webscraper', 'console_scripts', 'cazy_webscraper')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../cazy_webscraper/cazy_scraper.py", line 246, in main
get_cazy_data(
File "...//cazy_webscraper/cazy_scraper.py", line 355, in get_cazy_data cazy_data, successful_replacement = replace_multiple_tax(
^^^^^^^^^^^^^^^^^^^^^
File ".../cazy_webscraper/ncbi/taxonomy/multiple_taxa.py", line 135, in replace_multiple_tax
cazy_data, success = replace_multiple_tax_with_invalid_ids(cazy_data, args)
What's Changed
- Fix tax invalid ids by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/108
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.5...v2.2.6
- Python
Published by HobnobMancer over 3 years ago
cazy-webscraper - v2.2.5
Fix import error when retrieving protein sequences from NCBI, that was introduced in version 2.2.4
What's Changed
- Update imports by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/107
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.4...v2.2.5
- Python
Published by HobnobMancer over 3 years ago
cazy-webscraper - v2.2.4
What's Changed
- Issue 99 by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/102
Fix Issue #99: Improve handling when incurring errors when retrieving data from NCBI
- Separate invalid IDs to IDs that suffered to failed connections
- Parse batches containing invalid IDs separately to and before failed connection batches
Downloaded protein sequences are cached to a FASTA file.
Updated information in the docs on caching.
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.3...v2.2.4
- Python
Published by HobnobMancer over 3 years ago
cazy-webscraper - v2.2.3
Address issue #100 with failing to retrieve data from UniProt, owing to changes to the UniProt API.
All alters the methods for mapping UniProt accessions to GenBank accessions - including a more robust method for assigning data from UniProt to the correct protein in the local CAZyme database.
What's Changed
- Issue 100 uniprot by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/103
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.2...v2.2.3
- Python
Published by HobnobMancer over 3 years ago
cazy-webscraper - v2.2.2
- Update closing message information
- Update documentation
- Update third party citations
- Increases unit test coverage
What's Changed
- Update closing message by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/98
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.1...v2.2.2
- Python
Published by HobnobMancer over 3 years ago
cazy-webscraper - v2.2.1
Fix inability to parse Entrez.NotXMLError when retrieving protein sequences from NCBI.
What's Changed
- Fix inability to parse Entrez.NotXMLError by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/96
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.2.0...v2.2.1
- Python
Published by HobnobMancer almost 4 years ago
cazy-webscraper - v2.2.0
What's Changed
- Add getting GTDB taxonomic classifications by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/94
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.13.1...v2.2.0
- Python
Published by HobnobMancer almost 4 years ago
cazy-webscraper - v2.1.3.1
Remove unused imports. Fix import error bug.
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.1.3...v2.0.13.1
- Python
Published by HobnobMancer almost 4 years ago
cazy-webscraper - v2.1.3
What's New
Retrieve the latest taxonomic information from NCBI Retrieve complete taxonomic lineages from NCBI Removed unused imports Retrieve genomic assembly data from NCBI: - Assembly name - GenBank Assembly ID - GenBank Assembly version accession - RefSeq Assembly ID
- RefSeq Assembly version accession
What's Changed
- Issue 92 ncbi taxs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/93
- Get genomes by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/91
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.1.1...v2.1.3
- Python
Published by HobnobMancer almost 4 years ago
cazy-webscraper - v2.1.1
What's Changed
New features
- Aility to retrieve the latest taxonomic classifications from NCBI
Updates
- Update program architecture: The expand module contains a sub module per external database, group modules by the external database from which data is sourced
- Increase unit test coverage
- Simplify using
get_db_connection, takes 2 positional args,pathlib.Pathobject andbool Updated documentation
Get ncbi tax lineages by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/90
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.13...v2.1.1
- Python
Published by HobnobMancer about 4 years ago
cazy-webscraper - v2.0.13
- Remove unnecessary datafiles
- Reduce package size
- Python
Published by HobnobMancer about 4 years ago
cazy-webscraper - v2.0.12
What's Changed
- Fix logger by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/89
- Fixed logger inheritance bug, the
--verboseflag should be fully functional - Fix error when using CAZy class abbreviations
- Increased unit test coverage
- Updated documentation to match the latest CLI
- Fixed logger inheritance bug, the
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.10...v2.0.11
- Python
Published by HobnobMancer about 4 years ago
cazy-webscraper - v2.0.10
Add missing entry points, and update entry point names. - Add extracting seqs from the db entry point - Add api entry point
- Python
Published by HobnobMancer over 4 years ago
cazy-webscraper - v2.0.8
Update entry point import path and name to cw_extract_db_seqs
- Python
Published by HobnobMancer over 4 years ago
cazy-webscraper - v2.0.6
Summary
- Fixed incomplete retrieval of proteins from the local CAZyme database that match the specified criteria
- Improved clarity of data included in the Logs table
- Changed Pdbs 1-1 Genbanks relationship to many-to-many: Pdbs 1-* Genbanks_Pdbs *-1 Genbanks
- Made retrieval of proteins from the local CAZyme database that match the specified criteria significantly faster
- Updated documentation
- Finished API
- Fixed failed JSON serialisation of data retrieved by the API
- Add option to add prefix to filenames generated by the API
- Use the
saintBioutilspackage to handle logging and some file_io operations
What's Changed
- Trouble shoot extract seqs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/75
- Fix blank pdb accs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/79
- Fix log table contents by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/81
- Fix parsing config and selecting candidates of interest by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/83
- Tidy and update docs by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/84
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.5...v2.0.6
- Python
Published by HobnobMancer over 4 years ago
cazy-webscraper - v2.0.5
What's Changed
Pull Requests
- Trouble shoot uniprot by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/73
- Trouble shoot getting data from NCBI by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/74
Details
- UniProt: cazy_webscraper can now be used successfully for retrieving data from UniProt and adding the data to the local CAZyme database. This includes retrieving:
- UniProt accessions
- Protein names
- Protein sequences
- EC number annotations
PDB accessions
GenBank: cazywebscraper can now be used to automate the retreival of protein sequences from GenBank for proteins in a local CAZyme database mathcing the users specified critieria. These protein sequences are stored in the local CAZyme database, and can be extracted to a FASTA file using cazywebscraper
Caching:
More data is cached
Cached data can be used to continue data retrievals from UniProt and GenBank, when a previous retrieval and/or addition of the data to the database fails
Improved default name of cache dirs and subdirs
Unit tests: Started rewrite of unit tests to match the new program architecture
Documentation: Updating the documentation to include the new flags/options, and adding new tutorials for rautomating the retrieval if data from UniProt, GenBank and PDB
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.3...v2.0.5
- Python
Published by HobnobMancer over 4 years ago
cazy-webscraper - v2.0.3
v2.0.3
Beta release of version 2.
Bug fixes
- Fixes making only the parent dirs of an output database path
- Fixes not finding the
cazy_webscrapermodule
What's Changed
- Update unit tests by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/68
- Update unit tests by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/70
- update v number by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/71
- Fix output dir making by @HobnobMancer in https://github.com/HobnobMancer/cazy_webscraper/pull/72
Full Changelog: https://github.com/HobnobMancer/cazy_webscraper/compare/v2.0.0...v2.0.3
- Python
Published by HobnobMancer over 4 years ago
cazy-webscraper - v2.0.0
Beta release of version 2.
- Python
Published by HobnobMancer over 4 years ago
cazy-webscraper - pre_version_1_release
This release:
- changes the call command for
cazy_webscraperfromcazy_webscraper.pytocazy_webscraper - fixes typos in README and docs
cazy_webscrapercan now build all parent directories for a user specified output directory
- Python
Published by HobnobMancer about 5 years ago
cazy-webscraper - Installation integration
This release is created for integration into Bioconda and Pypi, with the main section of cazy_webscraper complete (for the retrieval of protein data from CAZy) and prior to the overhaul of the expand module.
- Python
Published by HobnobMancer about 5 years ago
cazy-webscraper - Zenodo citation
New release of package for Zenodo tagging to facilitate citing the programme.
- Python
Published by HobnobMancer over 5 years ago
cazy-webscraper - First release
This is the first release of the cazy_webscraper.
Current features include:
- Producing dataframes of the protein data in CAZy
- Retrieving protein sequences of scraped CAZymes from GenBank
- Retrieving protein structures of CAZymes from PDB
- Python
Published by HobnobMancer over 5 years ago