dataset-register
Components (API and crawler) for the NDE Dataset Register
https://github.com/netwerk-digitaal-erfgoed/dataset-register
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.6%) to scientific vocabulary
Keywords
Repository
Components (API and crawler) for the NDE Dataset Register
Basic Info
- Host: GitHub
- Owner: netwerk-digitaal-erfgoed
- License: eupl-1.2
- Language: TypeScript
- Default Branch: main
- Homepage: https://datasetregister.netwerkdigitaalerfgoed.nl/api/
- Size: 4.49 MB
Statistics
- Stars: 5
- Watchers: 2
- Forks: 3
- Open Issues: 116
- Releases: 0
Topics
Metadata Files
README.md
Dataset Register
This is the NDE Dataset Register, a service that helps users find and discover datasets.
Institutions (such as cultural heritage organizations) register their dataset descriptions with the NDE Dataset Register using its HTTP API. The Dataset Register builds an index by fetching, validating and periodically crawling dataset descriptions.
The HTTP API is documented at https://datasetregister.netwerkdigitaalerfgoed.nl/api.
See the Dataset Register Demonstrator, a client application for this repository’s HTTP API, for more background information (in Dutch).
Design principles
- The application follows modern standards and best practices.
- The application uses Linked Data Platform (LDP) for HTTP operations.
- The application prefers JSON-LD as the data exchange format.
- The application uses established Linked Data vocabularies, including Schema.org and DCAT.
Getting started
Validate dataset descriptions
Dataset descriptions must adhere to the Requirements for Datasets. You can check validity using the validate API call.
Submit dataset descriptions
To submit your dataset descriptions to the Dataset Register, use the datasets API call. URLs must be allowed before they can be added to the Register.
Search dataset descriptions
You can retrieve dataset descriptions registered by yourself and others from the SPARQL endpoint at https://datasetregister.netwerkdigitaalerfgoed.nl/sparql.
For example using Comunica:
comunica-sparql sparql@https://datasetregister.netwerkdigitaalerfgoed.nl/sparql 'select * {?s a <http://www.w3.org/ns/dcat#Dataset> . ?s ?p ?o . } limit 100'
Or curl:
curl -H Accept:application/sparql-results+json --data-urlencode 'query=select * {?s a <http://www.w3.org/ns/dcat#Dataset> . ?s ?p ?o . } limit 100' https://datasetregister.netwerkdigitaalerfgoed.nl/sparql
Automate registrations
If you want to automate dataset descriptions registrations by connecting your (collection management) application to the Dataset Register, please see the HTTP API documentation.
Run the application
To run the application yourself (for instance if you’d like to contribute, which you’re very welcome to do), follow these steps. (As mentioned above, find the hosted version at https://datasetregister.netwerkdigitaalerfgoed.nl/api).
This application stores data in a QLever SPARQL store, so you need to have that running locally:
docker compose up
When it runs, you can start the application in development mode. Clone this repository and run:
npm install
npx nx serve api
You can open a local QLever UI at http://localhost:7002/default.
Run in production
To run the application in production, first compile and then run it. You may want to disable logging, which is enabled by default:
npx nx build api --configuration=production
LOG=false npm start
Configuration
You can configure the application through environment variables:
SPARQL_URL: URL to the SPARQL store.SPARQL_ACCESS_TOKEN: access token for write operations on SPARQL Store.LOG: enable/disable logging (default:true).CRAWLER_SCHEDULE: a schedule in Cron format; for example0 * * * *to crawl every hour (default: crawling disabled).REGISTRATION_URL_TTL: if crawling is enabled, a registered URL’s maximum age (in seconds) before it is fetched again (default:86400, so one day).
Run the tests
The tests are run automatically on CI.
To run the tests locally, clone this repository, then:
npm install
npm test
Components
Crawler
The crawler will periodically fetch registration URLs (schema:EntryPoint) to update the dataset descriptions stored in the Dataset Register.
To enable the crawler, set the CRAWLER_SCHEDULE configuration variable.
The crawler will then check all registration URLs according to that schedule to see if any of the URLs have become outdated.
A registration URL is considered outdated if it has been last read longer than
REGISTRATION_URL_TTL ago (its schema:dateRead is older).
If any outdated registration URLs are found, they are fetched and updated in the SPARQL store.
Data model
schema:EntryPoint
Any URL registered by clients is added as a schema:EntryPoint to the
Registrations graph.
Datasets are fetched from this URL on registration and when crawling it.
| Property | Description |
| ------------- |----------------------------------------------------------------------------------------------------------------------------------------------|
| schema:datePosted | UTC datetime when the URL was registered. |
| schema:dateRead | UTC datetime when the URL was last read by the application. The crawler updates this value when fetching descriptions. |
| schema:status | The HTTP status code last encountered when fetching the URL. |
| schema:validUntil | If the URL has become invalid, the UTC datetime at which it did so. |
| schema:about | The set of schema:Datasets that the URL contains. The crawler updates this value when fetching descriptions. |
schema:Dataset
Each dataset that is found at the schema:EntryPoint registration URL gets added as a
schema:Dataset to the
Registrations graph.
| Property | Description |
|----------------------------------------------------|-----------------------------------------------------------------|
| schema:dateRead | UTC datetime when the dataset was last read by the application. |
| schema:subjectOf | From which registration URL the dataset was read. |
dcat:Dataset
When a dataset’s RDF description is fetched and validated, it is added as a dcat:Dataset to its own graph. The URL
of the graph corresponds to the dataset’s IRI.
If the dataset’s description is provided in Schema.org rather than DCAT, the description is first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.
| Property | Description | Based on |
| -------- | ----------- | -------- |
| dct:title | Dataset title. | schema:name |
| dct:alternative | Dataset alternate title. | schema:alternateName |
| dct:identifier | Dataset identifier. | schema:identifier |
| dct:description | Dataset description. | schema:description |
| dct:license | Dataset license. | schema:license |
| dct:language | Language(s) in which the dataset is available. | schema:inLanguage |
| dcat:keyword | Keywords or tags that describe the dataset. | schema:keywords |
| dcat:landingPage | URL of a webpage where the dataset is described. | schema:mainEntityOfPage |
| dct:source | URL(s) of datasets the dataset is based on. | schema:isBasedOn |
| dct:created | Dataset creation date. | schema:dateCreated |
| dct:issued | Dataset publication date. | schema:datePublished |
| dct:modified | Dataset last modification date. | schema:dateModified |
| owl:versionInfo | Dataset version | schema:version |
| dct:creator | Dataset creator. | schema:creator |
| dct:publisher | Dataset publisher. | schema:publisher |
| dcat:distribution | Dataset distributions. | schema:distribution |
foaf:Organization
The objects of both the dct:creator and dct:publisher dataset have type foaf:Organization.
If the dataset’s organizations are provided in Schema.org rather than DCAT, the organizations are first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.
| Property | Description | Based on |
| -------- | ----------- | -------- |
| foaf:name | Organization name. | schema:name |
dcat:Distribution
The objects of dcat:distribution dataset properties have type dcat:Distribution.
If the dataset’s distributions are provided in Schema.org rather than DCAT, the distributions are first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.
| Property | Description | Based on |
|-----------------------------------------------------------|--------------------------------------------------------------------| -------- |
| dcat:accessURL | Distribution URL. | schema:contentUrl |
| dcat:mediaType | Distribution’s IANA media type. | schema:fileFormat |
| dct:conformsTo | <https://www.w3.org/TR/sparql11-protocol/> for SPARQL endpoints. | schema:encodingFormat |
| dct:issued | Distribution publication date. | schema:datePublished |
| dct:modified | Distribution last modification date. | schema:dateModified |
| dct:description | Distribution description. | schema:description |
| dct:language | Distribution language. | schema:inLanguage |
| dct:license | Distribution license. | schema:license |
| dct:title | Distribution title. | schema:name |
| dcat:byteSize | Distribution’s download size in bytes. | schema:contentSize |
Allow list
A registration URL must be on a domain that is allowed before it can be added to the Register. Allowed domains are administered in the https://data.netwerkdigitaalerfgoed.nl/registry/alloweddomainnames RDF graph.
To add a URL:
sparql
INSERT DATA {
GRAPH <https://data.netwerkdigitaalerfgoed.nl/registry/allowed_domain_names> {
[] <https://data.netwerkdigitaalerfgoed.nl/allowed_domain_names/def/domain_name> "your-domain.com" .
}
}
Owner
- Name: Netwerk Digitaal Erfgoed
- Login: netwerk-digitaal-erfgoed
- Kind: organization
- Website: https://www.netwerkdigitaalerfgoed.nl
- Repositories: 71
- Profile: https://github.com/netwerk-digitaal-erfgoed
Dutch Digital Heritage Network
CodeMeta (codemeta.json)
{
"@context": [
"https://doi.org/10.5063/schema/codemeta-2.0",
"https://w3id.org/software-iodata",
"https://raw.githubusercontent.com/jantman/repostatus.org/master/badges/latest/ontology.jsonld",
"https://schema.org",
"https://w3id.org/software-types"
],
"@type": "SoftwareSourceCode",
"applicationCategory": {
"@id": "https://vocabs.dariah.eu/tadirah/discovering",
"http://www.w3.org/2004/02/skos/core#inScheme": {
"@id": "https://vocabs.dariah.eu/tadirah/"
}
},
"author": [
{
"@type": "Person",
"email": "david@ddeboer.nl",
"familyName": "de Boer",
"givenName": "David"
},
{
"@type": "Person",
"email": "bob.coret+github@gmail.com",
"familyName": "Coret",
"givenName": "Bob"
}
],
"codeRepository": "https://github.com/netwerk-digitaal-erfgoed/dataset-register.git",
"contIntegration": "https://github.com/netwerk-digitaal-erfgoed/dataset-register/actions",
"dateCreated": "2020-12-14T21:59:30Z+0100",
"description": "Live index of heritage datasets",
"developmentStatus": [
"https://www.repostatus.org/#active",
{
"@id": "https://w3id.org/research-technology-readiness-levels#Level9Proven",
"http://www.w3.org/2004/02/skos/core#inScheme": {
"@id": "https://w3id.org/research-technology-readiness-levels"
}
}
],
"issueTracker": "https://github.com/netwerk-digitaal-erfgoed/dataset-register/issues",
"funding": {
"@type": "Grant",
"funder": {
"@id": "https://identifier.overheid.nl/tooi/id/ministerie/mnre1109",
"@type": "Organization",
"name": "Ministerie van Onderwijs, Cultuur en Wetenschap",
"url": "https://www.rijksoverheid.nl/ministeries/ministerie-van-onderwijs-cultuur-en-wetenschap"
}
},
"keywords": [
"nde",
"datasets"
],
"license": "https://spdx.org/licenses/EUPL-1.2",
"name": "Dataset Register",
"producer": {
"@type": "Organization",
"name": "Netwerk Digitaal Erfgoed",
"url": "https://www.netwerkdigitaalerfgoed.nl"
},
"programmingLanguage": "TypeScript",
"readme": "https://github.com/netwerk-digitaal-erfgoed/dataset-register#readme",
"runtimePlatform": "node >=16",
"softwareHelp": {
"@id": "https://datasetregister.netwerkdigitaalerfgoed.nl/?lang=en",
"@type": "WebSite",
"url": "https://datasetregister.netwerkdigitaalerfgoed.nl/?lang=en"
},
"targetProduct": [
{
"@id": "https://datasetregister.netwerkdigitaalerfgoed.nl/api",
"@type": "WebApplication",
"name": "Dataset Register OpenAPI",
"url": "https://datasetregister.netwerkdigitaalerfgoed.nl/api"
}
],
"url": "https://github.com/netwerk-digitaal-erfgoed/dataset-register.git"
}
GitHub Events
Total
- Issues event: 67
- Watch event: 1
- Delete event: 98
- Issue comment event: 49
- Push event: 293
- Pull request review comment event: 1
- Pull request review event: 2
- Pull request event: 194
- Create event: 102
Last Year
- Issues event: 67
- Watch event: 1
- Delete event: 98
- Issue comment event: 49
- Push event: 293
- Pull request review comment event: 1
- Pull request review event: 2
- Pull request event: 194
- Create event: 102
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 141
- Total pull requests: 328
- Average time to close issues: 3 months
- Average time to close pull requests: 9 days
- Total issue authors: 8
- Total pull request authors: 3
- Average comments per issue: 1.38
- Average comments per pull request: 0.17
- Merged pull requests: 255
- Bot issues: 4
- Bot pull requests: 254
Past Year
- Issues: 38
- Pull requests: 163
- Average time to close issues: 21 days
- Average time to close pull requests: 3 days
- Issue authors: 6
- Pull request authors: 3
- Average comments per issue: 0.39
- Average comments per pull request: 0.09
- Merged pull requests: 131
- Bot issues: 3
- Bot pull requests: 144
Top Authors
Issue Authors
- coret (80)
- ddeboer (49)
- dependabot[bot] (4)
- mcoonen (2)
- faina007 (2)
- ivozandhuis (2)
- EnnoMeijers (1)
- AlexHaan-i (1)
Pull Request Authors
- dependabot[bot] (254)
- ddeboer (66)
- coret (8)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- 997 dependencies
- @rdfjs/types ^1.0.1 development
- @types/jest ^27.0.2 development
- @types/node ^18.0.0 development
- @types/node-schedule ^2.1.0 development
- @types/psl ^1.1.0 development
- @types/rdf-dataset-ext ^1.0.1 development
- @types/rdf-ext ^1.3.9 development
- @types/rdfjs__dataset ^2.0.0 development
- chokidar ^3.5.2 development
- gts ^3.0.3 development
- jest ^27.3.1 development
- jest-coverage-thresholds-bumper ^1.0.0 development
- jsonld-streaming-parser ^3.0.0 development
- microdata-rdf-streaming-parser ^1.2.0 development
- nock ^13.2.6 development
- rdfa-streaming-parser ^1.5.0 development
- ts-jest ^27.0.7 development
- ts-node ^10.4.0 development
- tsc-watch ^5.0.2 development
- typescript ^4.3.2 development
- @comunica/data-factory ^2.0.1
- @comunica/query-sparql ^2.0.5
- @fastify/accepts-serializer ^5.0.0
- @fastify/cors ^8.0.0
- @fastify/swagger ^7.3.0
- @rdfjs/dataset ^2.0.0
- @types/rdf-validate-shacl ^0.4.0
- asynciterator ^3.2.0
- fastify ^4.0.1
- graphdb ^2.0.0
- n3 ^1.10.0
- node-fetch ^3.1.0
- node-schedule ^2.0.0
- pino ^8.0.0
- psl ^1.8.0
- rdf-dataset-ext ^1.0.0
- rdf-dereference ^2.0.0
- rdf-ext ^2.0.1
- rdf-js ^4.0.2
- rdf-serialize ^2.0.0
- rdf-validate-shacl ^0.4.0
- dependabot/fetch-metadata v1.3.5 composite
- actions/checkout v3 composite
- digitalocean/action-doctl v2 composite
- docker/build-push-action v3 composite
- docker/login-action v2 composite
- docker/setup-buildx-action v2 composite
- actions/checkout v3 composite
- actions/setup-node v3 composite
- node lts-alpine build