Recent Releases of hydra-scraper

hydra-scraper - v0.9.5

Additional requirement: Pillow

New serialisation to NFDIcore/CTO v3
Revise intermediate data structure for NFDIcore/CTO v3
Enhanced lookup to accommodate NFDIcore/CTO v3
Automatic output of N-Triples instead of Turtle for NFDIcore/CTO v3
Include SCHEMA.associatedMedia and SCHEMA.encoding in schema.org media extraction
Improve XML and LIDO extraction, especially of related people and organisations
Add Getty TGN and further normalise Getty AAT identifiers
Add option to download media files and produce image thumbnails

- Python
Published by jonatansteller 10 months ago

hydra-scraper - v0.9.4

Remove local file paths from source-file path
Ignore LIDO files that have no LIDO content
Various fixes to deal with faulty URIs
Fix authentication issues when looking for a robots.txt
Support Basic Auth via --ba_username and --ba_password

- Python
Published by jonatansteller about 1 year ago

hydra-scraper - v0.9.3

Further LIDO fixes and enhancements
Better error handling to make sure KeyboardInterrupt always works

- Python
Published by jonatansteller about 1 year ago

hydra-scraper - v0.9.2

Retry fetching remote files in case of 5xx responses
Use file size to calculate RDFLib/pyoxigraph switch
Enhance LIDO conversion when image sizes are not indicated
Fix issue in CTO conversion where list are used instead of literal

- Python
Published by jonatansteller over 1 year ago

hydra-scraper - v0.9.1

Updated nfdicore/cto structure with altered prepare parameter

- Python
Published by jonatansteller over 1 year ago

hydra-scraper - v0.9.0

Full rewrite with a modular architecture
Any combination of Feed and FeedElement
Support for RDF (schema.org), XML (CMIF, LIDO), Beacon, ZIP ingest
Log but accept missing feed elements
Less memory hoarding with large datasets
Look-up routine for authority files
Single template to generate nfdicore/cto triples
Template adapted to current nfdicore/cto version
Automatically create ARK IDs for nfdicore/cto
Prep work for further serialisations such as DCAT
New command-line interface and argument parsing
A -quiet option prevents reporting intermiedate progress
Provide optional OCI (Podman/Docker) container set-up
Observe rules layed out in robots.txt files
Recognise http and https namespaces in schema.org sources
Provide log files for scraping runs
Switch to httpx

- Python
Published by jonatansteller over 1 year ago

hydra-scraper - v0.8.4

- Python
Published by jonatansteller over 2 years ago

hydra-scraper - v0.8.3

- Python
Published by jonatansteller over 2 years ago

hydra-scraper - v0.8.2

- Python
Published by jonatansteller over 2 years ago