hydra-scraper
Comprehensive scraper for paginated APIs, RDF, XML, file dumps, and Beacon files
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.2%) to scientific vocabulary
Repository
Comprehensive scraper for paginated APIs, RDF, XML, file dumps, and Beacon files
Basic Info
Statistics
- Stars: 4
- Watchers: 5
- Forks: 1
- Open Issues: 0
- Releases: 9
Metadata Files
README.md
Hydra Scraper
Comprehensive scraper for paginated APIs, RDF, XML, file dumps, and Beacon files
This scraper provides a command-line toolset to pull data from various sources, such as Hydra paginated schema.org APIs, Beacon files, or file dumps. The tool differentiates between feeds and their elements in RDF-compatible formats such as JSON-LD or Turtle, but it can also handle XML files using, for example, the LIDO schema. Command-line calls can be combined and adapted to build fully-fledged scraping mechanisms, including the ability to output a set of triples. The script was originally developed as an API testing tool for of the Corpus Vitrearum Germany (CVMA) at the Academy of Sciences and Literature Mainz. It was later expanded for use in the Culture Knowledge Graph at NFDI4Culture around the Culture Graph Interchange Format (CGIF) and the NFDIcore ontology with the CTO module.
Licence
Written and maintained by Jonatan Jalle Steller.
This code is covered by the MIT licence.
Workflows

Installation
There are two options to run this script: as a regular Python command-line tool
or in an OCI container (Podman/Docker). For the regular install, clone this
repository (e.g. git clone https://github.com/digicademy/hydra-scraper.git or
the SSH equivalent). Then open a terminal in the resulting folder and run
pip install -r requirements.txt to install the dependencies. To use the
script, use the commands listed under "Examples" below.
To run the container instead, which brings along all required Python tooling,
clone this repo and open a terminal in the resulting folder. In the command
examples below, replace python go.py with podman compose run hydra-scraper.
Usage
The scraper is a command-line tool. Use these main configuration options to indicate what kind of scraping run you desire.
-lor--location <url or folder or file>: source URI, folder, or file path-for--feed <value>: type of feed or starting point for the scraping run:beacon: a local or remote text file listing one URI per line (Beacon)cmif: a local or remote CMIF filefolder: a local folder or a local/remote ZIP archive of individual filesschema: an RDF-based, optionally Hydra-paginated schema.org API or embedded metadata (CGIF)schema-list: same as above, but using the triples in individual schema.org files
-eor--elements <value>: element markup to extract data from during the scraping run (leave out to not extract data):lido: use LIDO filesschema: use RDF triples in a schema.org format (CGIF)
-oor--output <value> <value>: outputs to produce in the scraping run:beacon: a text file listing one URI per linecsv: a CSV table of datacto: NFDI4Culture-style triplescto3: NFDI4Culture-style triples (CTO v3, to become justctowhen v2 is removed)media: associated media filesfiles: the original filestriples: the original triples
In addition, and depending on the main config, you can specify these additional options:
-nor--name <string>: name of the subfolder to download data to-dor--dialect <string>: content type to use for requests-ior--include <string>: filter for feed element URIs to include-ror--replace <string>: string to replace in feed element URIs-rwor--replace_with <string>: string to replace the previous one with-aor--append <string>: addition to the end of each feed element URI-afor-add_feed <uri>: URI of a data feed to bind members to-acor-add_catalog <uri>: URI of a data catalog the data feed belongs to-apor-add_publisher <uri>: URI of the data publisher-cor--clean <string> <string>: strings to remove from feed element URIs to build their file names-por--prepare <string> <string>: prepare cto output for this NFDI4Culture feed and catalog ID-buor--ba_username <string>: Basic Auth username for requests-bpor--ba_password <string>: Basic Auth password for requests-qor--quiet: do not display status messages
Examples
The commands listed below illustrate possible command-line arguments. They
refer to specific projects that use this scraper, but the commands should work
with any other page using the indicated formats. If you run the container,
replace python go.py with podman compose run hydra-scraper. You may need
to use docker instead of podman, or python3 instead of python.
NFDI4Culture
Original triples from the Culture Information Portal:
bash
python go.py -l https://nfdi4culture.de/resource.ttl -f schema-list -o triples -n n4c-portal
NFDIcore/CTO triples from a local or remote CGIF/schema.org feed (embedded):
bash
python go.py -l https://corpusvitrearum.de/cvma-digital/bildarchiv.html -f schema -e schema -o cto -n n4c-cgif -p E5308 E4229
NFDIcore/CTO triples from a local or remote CGIF/schema.org feed (API):
bash
python go.py -l https://gn.biblhertz.it/fotothek/seo -f schema -e schema -o cto -n n4c-cgif-api -p E6064 E4244
NFDIcore/CTO triples from a local or remote Beacon-like feed of CGIF/schema.org files:
bash
python go.py -l downloads/n4c-cgif/beacon.txt -f beacon -e schema -o cto -n n4c-cgif-beacon -a /about.cgif -p E5308 E4229
NFDIcore/CTO triples from a local or remote ZIP file containing CGIF/schema.org files:
bash
python go.py -l downloads/n4c-cgif.zip -f folder -e schema -o cto -n n4c-cgif-zip -af https://corpusvitrearum.de/cvma-digital/bildarchiv.html -p E5308 E4229
NFDIcore/CTO triples from a local folder containing CGIF/schema.org files:
bash
python go.py -l downloads/n4c-cgif/files -f folder -e schema -o cto -n n4c-cgif-folder -p E5308 E4229
NFDIcore/CTO triples from a local or remote Beacon-like feed of LIDO files (feed URI added because it is not in the data):
bash
python go.py -l downloads/n4c-cgif/beacon.txt -f beacon -e lido -o cto -n n4c-lido -a /about.lido -af https://corpusvitrearum.de/cvma-digital/bildarchiv.html -p E5308 E4229
NFDIcore/CTO triples from a local or remote ZIP file containing LIDO files:
bash
python go.py -l downloads/n4c-bildindex.zip -f folder -e lido -o cto -n n4c-bildindex -af https://www.bildindex.de/ete?action=objectMode -ac https://www.bildindex.de/ -p E6161 E2916
NFDIcore/CTO triples from a local folder containing LIDO files:
bash
python go.py -l downloads/n4c-lido/files -f folder -e lido -o cto -n n4c-lido-folder -af https://corpusvitrearum.de/cvma-digital/bildarchiv.html -p E5308 E4229
Corpus Vitrearum Germany
Files and triples from JSON-LD data:
bash
python go.py -l https://corpusvitrearum.de/id/about.json -f schema-list -o files triples -n cvma-jsonld -i https://corpusvitrearum.de/id/F -c https://corpusvitrearum.de/id/ /about.json
Files and triples from RDF/XML data:
bash
python go.py -l https://corpusvitrearum.de/id/about.rdf -f schema-list -o files triples -n cvma-rdfxml -i https://corpusvitrearum.de/id/F -c https://corpusvitrearum.de/id/ /about.rdf
Files and triples from Turtle data:
bash
python go.py -l https://corpusvitrearum.de/id/about.ttl -f schema-list -o files triples -n cvma-turtle -i https://corpusvitrearum.de/id/F -c https://corpusvitrearum.de/id/ /about.ttl
Beacon, CSV table, NFDIcore/CTO, files, and triples from CGIF/schema.org (embedded) data:
bash
python go.py -l https://corpusvitrearum.de/cvma-digital/bildarchiv.html -f schema -e schema -o beacon csv cto files triples -n cvma-cgif -p E5308 E4229 -c https://corpusvitrearum.de/id/
Beacon, CSV table, NFDIcore/CTO, files, and triples from CGIF/schema.org (API) data:
bash
python go.py -l https://corpusvitrearum.de/id/about.cgif -f schema -e schema -o beacon csv cto files triples -n cvma-cgif-api -p E5308 E4229
Beacon, CSV table, NFDIcore/CTO, and files from LIDO data:
bash
python go.py -l https://corpusvitrearum.de/cvma-digital/bildarchiv.html -f schema-list -e lido -o beacon csv cto files -n cvma-lido -a /about.lido -c https://corpusvitrearum.de/id/ /about.lido
Contributing
The file go.py executes a regular scraping run via several base modules that can also be used independently:
organiseprovides theOrganiseobject to collect and clean configuration info. It also creates the required folders, sets up logging, and uses an additionalProgressobject to show progress messages to users.jobprovides theJobobject to orchestrate a single scraping run. It contains the feed pagination and data collation logic.fileprovides theFileobject to retrieve a remote or local file. It also contains logic to identify file types and parse RDF or XML.dataprovides the data storage objectsUri,UriList,Label,LabelList,UriLabel,UriLabelList,Date,DateList, andIncipit. They include data serialisation logic and namespace normalization.extractis a special module to provideExtractFeedInterfaceandExtractFeedElementInterface. These include generic functions to extract XML or RDF data.mapis another special module to provideMapFeedInterfaceandMapFeedElementInterface. These include generic functions to generate text content or RDF triples.lookupprovides aLookupobject as well as type lists for authority files. These can be used to identify whether a URI refers to a person, an organisation, a location, an event, or something else.
Two additional sets of classes use the extract and map interfaces to provide extraction and mapping routines for particular formats. These routines provide a Feed and/or a FeedElement object depending on what the format provides. These format-specific objects are called from the Job object listed above.
If you change the code, please remember to document each object/function and walk other users or maintainers through significant steps. This package is governed by the Contributor Covenant code of conduct. Please keep this in mind in all interactions.
Releasing
Before you make a new release, make sure the following files are up to date:
base/file.py: version number in user agentCHANGELOG.md: version number and changesCITATION.cff: version number, authors, and release dateLICENCE.txt: release daterequirements.txt: list of required librariessetup.py: version number and authors
Use GitHub to make the release. Use semantic versioning.
Roadmap
- Remove CTO2 along with two deprecated intermediate properties
- Merge retrieval functions of
FileandMediaFile - Remove
rdfapaths as the format is no longer supported by RDFLib - Implement job presets/collections
- Convert
test.pyto something more sophisticated - Automatically build OCI container via CI instead of on demand
- Add TEI ingest support based on Gregorovius
- Add OAI-PMH ingest support
- Add MEI ingest support
- Add DCAT serialisation
Further ideas
- Add support for ingesting CSV/JSON data
- Use the system's download folder instead of
downloadsandunpackto be able to distribute the package - Fix setup file and release the package on PyPI
- Find a lightweight way to periodically update the RDF class lists
Owner
- Name: Digital Academy
- Login: digicademy
- Kind: organization
- Location: Mainz
- Website: https://www.adwmainz.de/digitalitaet/digitale-akademie.html
- Twitter: digicademy
- Repositories: 111
- Profile: https://github.com/digicademy
Digital Humanities at the Academy of Sciences and Literature Mainz
Citation (CITATION.cff)
cff-version: 1.2.0
title: Hydra Scraper
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Jonatan Jalle
family-names: Steller
email: jonatan.steller@adwmainz.de
affiliation: Academy of Sciences and Literature Mainz
orcid: 'https://orcid.org/0000-0002-5101-5275'
repository-code: >-
https://github.com/digicademy/hydra-scraper
abstract: >-
This scraper provides a command-line toolset to pull data
from various sources, such as Hydra paginated schema.org
APIs, Beacon files, or file dumps. The tool
differentiates between feeds and their elements in
RDF-compatible formats such as JSON-LD or Turtle, but it
can also handle XML files using, for example, the LIDO
schema. Command-line calls can be combined and adapted to
build fully-fledged scraping mechanisms, including the
ability to output a set of triples. The script was
originally developed as an API testing tool for of the
Corpus Vitrearum Germany (CVMA) at the Academy of
Sciences and Literature Mainz. It was later expanded for
use in the Culture Knowledge Graph at NFDI4Culture around
the Culture Graph Interchange Format (CGIF) and the
NFDIcore ontology with the CTO module.
keywords:
- scraping
- Hydra Core Vocabulary
- API
- Python
- Culture Graph Interchange Format
- Culture Knowledge Graph
- NFDI
- NFDI4Culture
- XML
- LIDO
- RDF
license: MIT
version: 0.9.6
date-released: '2025-08-27'
GitHub Events
Total
- Create event: 5
- Issues event: 2
- Release event: 5
- Watch event: 4
- Delete event: 2
- Issue comment event: 4
- Push event: 52
- Pull request review event: 1
- Pull request event: 4
Last Year
- Create event: 5
- Issues event: 2
- Release event: 5
- Watch event: 4
- Delete event: 2
- Issue comment event: 4
- Push event: 52
- Pull request review event: 1
- Pull request event: 4