Recent Releases of hydra-scraper

hydra-scraper - v0.9.5

Additional requirement: Pillow

  • New serialisation to NFDIcore/CTO v3
  • Revise intermediate data structure for NFDIcore/CTO v3
  • Enhanced lookup to accommodate NFDIcore/CTO v3
  • Automatic output of N-Triples instead of Turtle for NFDIcore/CTO v3
  • Include SCHEMA.associatedMedia and SCHEMA.encoding in schema.org media extraction
  • Improve XML and LIDO extraction, especially of related people and organisations
  • Add Getty TGN and further normalise Getty AAT identifiers
  • Add option to download media files and produce image thumbnails

- Python
Published by jonatansteller 6 months ago

hydra-scraper - v0.9.4

  • Remove local file paths from source-file path
  • Ignore LIDO files that have no LIDO content
  • Various fixes to deal with faulty URIs
  • Fix authentication issues when looking for a robots.txt
  • Support Basic Auth via --ba_username and --ba_password

- Python
Published by jonatansteller 10 months ago

hydra-scraper - v0.9.3

  • Further LIDO fixes and enhancements
  • Better error handling to make sure KeyboardInterrupt always works

- Python
Published by jonatansteller 10 months ago

hydra-scraper - v0.9.2

  • Retry fetching remote files in case of 5xx responses
  • Use file size to calculate RDFLib/pyoxigraph switch
  • Enhance LIDO conversion when image sizes are not indicated
  • Fix issue in CTO conversion where list are used instead of literal

- Python
Published by jonatansteller about 1 year ago

hydra-scraper - v0.9.1

  • Updated nfdicore/cto structure with altered prepare parameter

- Python
Published by jonatansteller over 1 year ago

hydra-scraper - v0.9.0

  • Full rewrite with a modular architecture
  • Any combination of Feed and FeedElement
  • Support for RDF (schema.org), XML (CMIF, LIDO), Beacon, ZIP ingest
  • Log but accept missing feed elements
  • Less memory hoarding with large datasets
  • Look-up routine for authority files
  • Single template to generate nfdicore/cto triples
  • Template adapted to current nfdicore/cto version
  • Automatically create ARK IDs for nfdicore/cto
  • Prep work for further serialisations such as DCAT
  • New command-line interface and argument parsing
  • A -quiet option prevents reporting intermiedate progress
  • Provide optional OCI (Podman/Docker) container set-up
  • Observe rules layed out in robots.txt files
  • Recognise http and https namespaces in schema.org sources
  • Provide log files for scraping runs
  • Switch to httpx

- Python
Published by jonatansteller over 1 year ago

hydra-scraper - v0.8.4

- Python
Published by jonatansteller over 2 years ago

hydra-scraper - v0.8.3

- Python
Published by jonatansteller over 2 years ago

hydra-scraper - v0.8.2

- Python
Published by jonatansteller over 2 years ago