Recent Releases of hydra-scraper
hydra-scraper - v0.9.5
Additional requirement: Pillow
- New serialisation to NFDIcore/CTO v3
- Revise intermediate data structure for NFDIcore/CTO v3
- Enhanced lookup to accommodate NFDIcore/CTO v3
- Automatic output of N-Triples instead of Turtle for NFDIcore/CTO v3
- Include
SCHEMA.associatedMediaandSCHEMA.encodingin schema.org media extraction - Improve XML and LIDO extraction, especially of related people and organisations
- Add Getty TGN and further normalise Getty AAT identifiers
- Add option to download media files and produce image thumbnails
- Python
Published by jonatansteller 6 months ago
hydra-scraper - v0.9.4
- Remove local file paths from source-file path
- Ignore LIDO files that have no LIDO content
- Various fixes to deal with faulty URIs
- Fix authentication issues when looking for a
robots.txt - Support Basic Auth via
--ba_usernameand--ba_password
- Python
Published by jonatansteller 10 months ago
hydra-scraper - v0.9.3
- Further LIDO fixes and enhancements
- Better error handling to make sure
KeyboardInterruptalways works
- Python
Published by jonatansteller 10 months ago
hydra-scraper - v0.9.2
- Retry fetching remote files in case of 5xx responses
- Use file size to calculate RDFLib/pyoxigraph switch
- Enhance LIDO conversion when image sizes are not indicated
- Fix issue in CTO conversion where list are used instead of literal
- Python
Published by jonatansteller about 1 year ago
hydra-scraper - v0.9.1
- Updated nfdicore/cto structure with altered
prepareparameter
- Python
Published by jonatansteller over 1 year ago
hydra-scraper - v0.9.0
- Full rewrite with a modular architecture
- Any combination of Feed and FeedElement
- Support for RDF (schema.org), XML (CMIF, LIDO), Beacon, ZIP ingest
- Log but accept missing feed elements
- Less memory hoarding with large datasets
- Look-up routine for authority files
- Single template to generate
nfdicore/ctotriples - Template adapted to current
nfdicore/ctoversion - Automatically create ARK IDs for
nfdicore/cto - Prep work for further serialisations such as DCAT
- New command-line interface and argument parsing
- A
-quietoption prevents reporting intermiedate progress - Provide optional OCI (Podman/Docker) container set-up
- Observe rules layed out in
robots.txtfiles - Recognise
httpandhttpsnamespaces in schema.org sources - Provide log files for scraping runs
- Switch to
httpx
- Python
Published by jonatansteller over 1 year ago