Recent Releases of wpextract
wpextract -
- WPextract can now be installed with Python 3.13 and no longer specifies a hard upper Python bound.
- Python
Published by github-actions[bot] about 1 year ago
wpextract -
Features & Improvements
- WPextract is now completely type-annotated, and ships a
py.typedfile to indicate this - Added
--user-agentargument towpextract downloadto allow customisation of the user agent string - HTTP errors raised when downloading now all inherit from a common
HTTPErrorclass - If an HTTP error is encountered while downloading, it will no longer end the whole scrape process. A warning will be logged and the scrape will continue, and if some data was obtained for that type, it will be saved as normal. HTTP transit errors (e.g. connection timeouts) will still end the scrape process.
- Improved the resiliency of HTML parsing and extraction by better checking for edge cases like missing attributes
- Translation picker extractors will now raise an exception if elements are missing during the extraction process.
- Simplified the WordPress API library by removing now-unused cache functionality. This will likely improve memory usage of the download process.
- Significantly more tests have been added, particularly for the download process
Fixes
- Fixed the scrape crawling step crashing if a page didn't have a canonical link or
og:urlmeta tag - Fixed the scrape crawling not correctly recognising when duplicate URLs were encountered. Previously duplicates would be included, but only one would be used. Now, they will be correctly logged. As a result of this change, the
SCRAPE_CRAWL_VERSIONhas been incremented, meaning running extraction on a scrape will cause it to be re-crawled. - Fixed the return type annotation
LangPicker.get_root(): it is now annotated as (bs4.TagorNone) instead ofbs4.PageElement. This shouldn't be a breaking change, as the expected return type of this function was always aTagobject orNone. - Type of
TranslationLink.langchanged to reflect that it can accept a string to resolve or an already resolvedLanguageinstance - Fixed downloading throwing an error stating the WordPress v2 API was not supported in other error cases
- Fixed the maximum redirects permitted not being set properly, meaning the effective value was always 30
Documentation
- Improved guide on translation parsing, correcting some errors and adding information on parse robustness and performance
- Python
Published by github-actions[bot] over 1 year ago
wpextract -
Changes
- Added missing
wpextract.__version__attribute (#36) - Added
<table>s to the elements to be ignored when extracting article text (#40)
Fixes
- Fixed incorrect behaviour extracting article text where only the first element to ignore (e.g.
figcaption) would be ignored (#40)
Documentation
- Added proper references to the documentation of the
langcodeslibrary (#38)
- Python
Published by github-actions[bot] over 1 year ago
wpextract -
- Fixed not explicitly declaring dependency on
urllib3(#32) - Improved CLI performance with lazy imports of library functionality (#33)
- Python
Published by github-actions[bot] over 1 year ago
wpextract - 1.0.1post1
Post-release to facilitate migrating docs to RtD.
- Python
Published by github-actions[bot] over 1 year ago
wpextract -
Bug Fixes
- Fixed an incorrect repository URL in the package metadata and CLI epilog (#29)
- Python
Published by github-actions[bot] over 1 year ago
wpextract -
Released: 11th July 2024
This release is a major overhaul of the tool including built-in download functionality.
- Moved the extraction functionality to the
wpextract extractsubcommand - Integrate a heavily modified version of WPJsonScraper as the
wpextract downloadsubcommand - Renamed the main package of this library to
wpextractto match the CLI tool name - Support extraction without an HTML scrape if translations aren't needed
- Support extracting only some of the possible data types
- Support sites without Yoast SEO plugin
- Added online documentation
- Python
Published by github-actions[bot] over 1 year ago