dataset-register

Components (API and crawler) for the NDE Dataset Register

https://github.com/netwerk-digitaal-erfgoed/dataset-register

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.6%) to scientific vocabulary

Keywords

dataset-register open-data rdf
Last synced: 4 months ago · JSON representation

Repository

Components (API and crawler) for the NDE Dataset Register

Basic Info
Statistics
  • Stars: 5
  • Watchers: 2
  • Forks: 3
  • Open Issues: 116
  • Releases: 0
Topics
dataset-register open-data rdf
Created about 5 years ago · Last pushed 4 months ago
Metadata Files
Readme License Codemeta

README.md

Dataset Register

This is the NDE Dataset Register, a service that helps users find and discover datasets.

Institutions (such as cultural heritage organizations) register their dataset descriptions with the NDE Dataset Register using its HTTP API. The Dataset Register builds an index by fetching, validating and periodically crawling dataset descriptions.

The HTTP API is documented at https://datasetregister.netwerkdigitaalerfgoed.nl/api.

See the Dataset Register Demonstrator, a client application for this repository’s HTTP API, for more background information (in Dutch).

Design principles

  1. The application follows modern standards and best practices.
  2. The application uses Linked Data Platform (LDP) for HTTP operations.
  3. The application prefers JSON-LD as the data exchange format.
  4. The application uses established Linked Data vocabularies, including Schema.org and DCAT.

Getting started

Validate dataset descriptions

Dataset descriptions must adhere to the Requirements for Datasets. You can check validity using the validate API call.

Submit dataset descriptions

To submit your dataset descriptions to the Dataset Register, use the datasets API call. URLs must be allowed before they can be added to the Register.

Search dataset descriptions

You can retrieve dataset descriptions registered by yourself and others from the SPARQL endpoint at https://datasetregister.netwerkdigitaalerfgoed.nl/sparql.

For example using Comunica:

comunica-sparql sparql@https://datasetregister.netwerkdigitaalerfgoed.nl/sparql 'select * {?s a <http://www.w3.org/ns/dcat#Dataset> . ?s ?p ?o . } limit 100'

Or curl:

curl -H Accept:application/sparql-results+json --data-urlencode 'query=select * {?s a <http://www.w3.org/ns/dcat#Dataset> . ?s ?p ?o . } limit 100' https://datasetregister.netwerkdigitaalerfgoed.nl/sparql 

Automate registrations

If you want to automate dataset descriptions registrations by connecting your (collection management) application to the Dataset Register, please see the HTTP API documentation.

Run the application

To run the application yourself (for instance if you’d like to contribute, which you’re very welcome to do), follow these steps. (As mentioned above, find the hosted version at https://datasetregister.netwerkdigitaalerfgoed.nl/api).

This application stores data in a QLever SPARQL store, so you need to have that running locally:

docker compose up

When it runs, you can start the application in development mode. Clone this repository and run:

npm install npx nx serve api

You can open a local QLever UI at http://localhost:7002/default.

Run in production

To run the application in production, first compile and then run it. You may want to disable logging, which is enabled by default:

npx nx build api --configuration=production LOG=false npm start

Configuration

You can configure the application through environment variables:

  • SPARQL_URL: URL to the SPARQL store.
  • SPARQL_ACCESS_TOKEN: access token for write operations on SPARQL Store.
  • LOG: enable/disable logging (default: true).
  • CRAWLER_SCHEDULE: a schedule in Cron format; for example 0 * * * * to crawl every hour (default: crawling disabled).
  • REGISTRATION_URL_TTL: if crawling is enabled, a registered URL’s maximum age (in seconds) before it is fetched again (default: 86400, so one day).

Run the tests

The tests are run automatically on CI.

To run the tests locally, clone this repository, then:

npm install npm test

Components

Crawler

The crawler will periodically fetch registration URLs (schema:EntryPoint) to update the dataset descriptions stored in the Dataset Register.

To enable the crawler, set the CRAWLER_SCHEDULE configuration variable. The crawler will then check all registration URLs according to that schedule to see if any of the URLs have become outdated. A registration URL is considered outdated if it has been last read longer than REGISTRATION_URL_TTL ago (its schema:dateRead is older).

If any outdated registration URLs are found, they are fetched and updated in the SPARQL store.

Data model

schema:EntryPoint

Any URL registered by clients is added as a schema:EntryPoint to the Registrations graph.

Datasets are fetched from this URL on registration and when crawling it.

| Property | Description | | ------------- |----------------------------------------------------------------------------------------------------------------------------------------------| | schema:datePosted | UTC datetime when the URL was registered. | | schema:dateRead | UTC datetime when the URL was last read by the application. The crawler updates this value when fetching descriptions. | | schema:status | The HTTP status code last encountered when fetching the URL. | | schema:validUntil | If the URL has become invalid, the UTC datetime at which it did so. | | schema:about | The set of schema:Datasets that the URL contains. The crawler updates this value when fetching descriptions. |

schema:Dataset

Each dataset that is found at the schema:EntryPoint registration URL gets added as a schema:Dataset to the Registrations graph.

| Property | Description | |----------------------------------------------------|-----------------------------------------------------------------| | schema:dateRead | UTC datetime when the dataset was last read by the application. | | schema:subjectOf | From which registration URL the dataset was read. |

dcat:Dataset

When a dataset’s RDF description is fetched and validated, it is added as a dcat:Dataset to its own graph. The URL of the graph corresponds to the dataset’s IRI.

If the dataset’s description is provided in Schema.org rather than DCAT, the description is first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.

| Property | Description | Based on | | -------- | ----------- | -------- | | dct:title | Dataset title. | schema:name | | dct:alternative | Dataset alternate title. | schema:alternateName | | dct:identifier | Dataset identifier. | schema:identifier | | dct:description | Dataset description. | schema:description | | dct:license | Dataset license. | schema:license | | dct:language | Language(s) in which the dataset is available. | schema:inLanguage | | dcat:keyword | Keywords or tags that describe the dataset. | schema:keywords | | dcat:landingPage | URL of a webpage where the dataset is described. | schema:mainEntityOfPage | | dct:source | URL(s) of datasets the dataset is based on. | schema:isBasedOn | | dct:created | Dataset creation date. | schema:dateCreated | | dct:issued | Dataset publication date. | schema:datePublished | | dct:modified | Dataset last modification date. | schema:dateModified | | owl:versionInfo | Dataset version | schema:version | | dct:creator | Dataset creator. | schema:creator | | dct:publisher | Dataset publisher. | schema:publisher | | dcat:distribution | Dataset distributions. | schema:distribution |

foaf:Organization

The objects of both the dct:creator and dct:publisher dataset have type foaf:Organization.

If the dataset’s organizations are provided in Schema.org rather than DCAT, the organizations are first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.

| Property | Description | Based on | | -------- | ----------- | -------- | | foaf:name | Organization name. | schema:name |

dcat:Distribution

The objects of dcat:distribution dataset properties have type dcat:Distribution.

If the dataset’s distributions are provided in Schema.org rather than DCAT, the distributions are first converted to DCAT. The ‘Based on’ column shows the corresponding Schema.org property. See the Requirements for Datasets for more details.

| Property | Description | Based on | |-----------------------------------------------------------|--------------------------------------------------------------------| -------- | | dcat:accessURL | Distribution URL. | schema:contentUrl | | dcat:mediaType | Distribution’s IANA media type. | schema:fileFormat | | dct:conformsTo | <https://www.w3.org/TR/sparql11-protocol/> for SPARQL endpoints. | schema:encodingFormat | | dct:issued | Distribution publication date. | schema:datePublished | | dct:modified | Distribution last modification date. | schema:dateModified | | dct:description | Distribution description. | schema:description | | dct:language | Distribution language. | schema:inLanguage | | dct:license | Distribution license. | schema:license | | dct:title | Distribution title. | schema:name | | dcat:byteSize | Distribution’s download size in bytes. | schema:contentSize |

Allow list

A registration URL must be on a domain that is allowed before it can be added to the Register. Allowed domains are administered in the https://data.netwerkdigitaalerfgoed.nl/registry/alloweddomainnames RDF graph.

To add a URL:

sparql INSERT DATA { GRAPH <https://data.netwerkdigitaalerfgoed.nl/registry/allowed_domain_names> { [] <https://data.netwerkdigitaalerfgoed.nl/allowed_domain_names/def/domain_name> "your-domain.com" . } }

Owner

  • Name: Netwerk Digitaal Erfgoed
  • Login: netwerk-digitaal-erfgoed
  • Kind: organization

Dutch Digital Heritage Network

CodeMeta (codemeta.json)

{
  "@context": [
    "https://doi.org/10.5063/schema/codemeta-2.0",
    "https://w3id.org/software-iodata",
    "https://raw.githubusercontent.com/jantman/repostatus.org/master/badges/latest/ontology.jsonld",
    "https://schema.org",
    "https://w3id.org/software-types"
  ],
  "@type": "SoftwareSourceCode",
  "applicationCategory": {
    "@id": "https://vocabs.dariah.eu/tadirah/discovering",
    "http://www.w3.org/2004/02/skos/core#inScheme": {
      "@id": "https://vocabs.dariah.eu/tadirah/"
    }
  },
  "author": [
    {
      "@type": "Person",
      "email": "david@ddeboer.nl",
      "familyName": "de Boer",
      "givenName": "David"
    },
    {
      "@type": "Person",
      "email": "bob.coret+github@gmail.com",
      "familyName": "Coret",
      "givenName": "Bob"
    }
  ],
  "codeRepository": "https://github.com/netwerk-digitaal-erfgoed/dataset-register.git",
  "contIntegration": "https://github.com/netwerk-digitaal-erfgoed/dataset-register/actions",
  "dateCreated": "2020-12-14T21:59:30Z+0100",
  "description": "Live index of heritage datasets",
  "developmentStatus": [
    "https://www.repostatus.org/#active",
    {
      "@id": "https://w3id.org/research-technology-readiness-levels#Level9Proven",
      "http://www.w3.org/2004/02/skos/core#inScheme": {
        "@id": "https://w3id.org/research-technology-readiness-levels"
      }
    }
  ],
  "issueTracker": "https://github.com/netwerk-digitaal-erfgoed/dataset-register/issues",
  "funding": {
    "@type": "Grant",
    "funder": {
      "@id": "https://identifier.overheid.nl/tooi/id/ministerie/mnre1109",
      "@type": "Organization",
      "name": "Ministerie van Onderwijs, Cultuur en Wetenschap",
      "url": "https://www.rijksoverheid.nl/ministeries/ministerie-van-onderwijs-cultuur-en-wetenschap"
    }
  },
  "keywords": [
    "nde",
    "datasets"
  ],
  "license": "https://spdx.org/licenses/EUPL-1.2",
  "name": "Dataset Register",
  "producer": {
    "@type": "Organization",
    "name": "Netwerk Digitaal Erfgoed",
    "url": "https://www.netwerkdigitaalerfgoed.nl"
  },
  "programmingLanguage": "TypeScript",
  "readme": "https://github.com/netwerk-digitaal-erfgoed/dataset-register#readme",
  "runtimePlatform": "node >=16",
  "softwareHelp": {
    "@id": "https://datasetregister.netwerkdigitaalerfgoed.nl/?lang=en",
    "@type": "WebSite",
    "url": "https://datasetregister.netwerkdigitaalerfgoed.nl/?lang=en"
  },
  "targetProduct": [
    {
      "@id": "https://datasetregister.netwerkdigitaalerfgoed.nl/api",
      "@type": "WebApplication",
      "name": "Dataset Register OpenAPI",
      "url": "https://datasetregister.netwerkdigitaalerfgoed.nl/api"
    }
  ],
  "url": "https://github.com/netwerk-digitaal-erfgoed/dataset-register.git"
}

GitHub Events

Total
  • Issues event: 67
  • Watch event: 1
  • Delete event: 98
  • Issue comment event: 49
  • Push event: 293
  • Pull request review comment event: 1
  • Pull request review event: 2
  • Pull request event: 194
  • Create event: 102
Last Year
  • Issues event: 67
  • Watch event: 1
  • Delete event: 98
  • Issue comment event: 49
  • Push event: 293
  • Pull request review comment event: 1
  • Pull request review event: 2
  • Pull request event: 194
  • Create event: 102

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 141
  • Total pull requests: 328
  • Average time to close issues: 3 months
  • Average time to close pull requests: 9 days
  • Total issue authors: 8
  • Total pull request authors: 3
  • Average comments per issue: 1.38
  • Average comments per pull request: 0.17
  • Merged pull requests: 255
  • Bot issues: 4
  • Bot pull requests: 254
Past Year
  • Issues: 38
  • Pull requests: 163
  • Average time to close issues: 21 days
  • Average time to close pull requests: 3 days
  • Issue authors: 6
  • Pull request authors: 3
  • Average comments per issue: 0.39
  • Average comments per pull request: 0.09
  • Merged pull requests: 131
  • Bot issues: 3
  • Bot pull requests: 144
Top Authors
Issue Authors
  • coret (80)
  • ddeboer (49)
  • dependabot[bot] (4)
  • mcoonen (2)
  • faina007 (2)
  • ivozandhuis (2)
  • EnnoMeijers (1)
  • AlexHaan-i (1)
Pull Request Authors
  • dependabot[bot] (254)
  • ddeboer (66)
  • coret (8)
Top Labels
Issue Labels
enhancement (9) question (6) dependencies (4) javascript (4) discuss (2) bug (1) wontfix (1) requirements (1) SHACL (1)
Pull Request Labels
dependencies (254) javascript (245) github_actions (8)

Dependencies

package-lock.json npm
  • 997 dependencies
package.json npm
  • @rdfjs/types ^1.0.1 development
  • @types/jest ^27.0.2 development
  • @types/node ^18.0.0 development
  • @types/node-schedule ^2.1.0 development
  • @types/psl ^1.1.0 development
  • @types/rdf-dataset-ext ^1.0.1 development
  • @types/rdf-ext ^1.3.9 development
  • @types/rdfjs__dataset ^2.0.0 development
  • chokidar ^3.5.2 development
  • gts ^3.0.3 development
  • jest ^27.3.1 development
  • jest-coverage-thresholds-bumper ^1.0.0 development
  • jsonld-streaming-parser ^3.0.0 development
  • microdata-rdf-streaming-parser ^1.2.0 development
  • nock ^13.2.6 development
  • rdfa-streaming-parser ^1.5.0 development
  • ts-jest ^27.0.7 development
  • ts-node ^10.4.0 development
  • tsc-watch ^5.0.2 development
  • typescript ^4.3.2 development
  • @comunica/data-factory ^2.0.1
  • @comunica/query-sparql ^2.0.5
  • @fastify/accepts-serializer ^5.0.0
  • @fastify/cors ^8.0.0
  • @fastify/swagger ^7.3.0
  • @rdfjs/dataset ^2.0.0
  • @types/rdf-validate-shacl ^0.4.0
  • asynciterator ^3.2.0
  • fastify ^4.0.1
  • graphdb ^2.0.0
  • n3 ^1.10.0
  • node-fetch ^3.1.0
  • node-schedule ^2.0.0
  • pino ^8.0.0
  • psl ^1.8.0
  • rdf-dataset-ext ^1.0.0
  • rdf-dereference ^2.0.0
  • rdf-ext ^2.0.1
  • rdf-js ^4.0.2
  • rdf-serialize ^2.0.0
  • rdf-validate-shacl ^0.4.0
.github/workflows/dependabot-auto-merge.yml actions
  • dependabot/fetch-metadata v1.3.5 composite
.github/workflows/deploy.yml actions
  • actions/checkout v3 composite
  • digitalocean/action-doctl v2 composite
  • docker/build-push-action v3 composite
  • docker/login-action v2 composite
  • docker/setup-buildx-action v2 composite
.github/workflows/qa.yml actions
  • actions/checkout v3 composite
  • actions/setup-node v3 composite
Dockerfile docker
  • node lts-alpine build