https://github.com/acdh-oeaw/urinormalizer

https://github.com/acdh-oeaw/urinormalizer

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary

Keywords

arche
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: acdh-oeaw
  • License: mit
  • Language: PHP
  • Default Branch: master
  • Homepage:
  • Size: 48.8 KB
Statistics
  • Stars: 0
  • Watchers: 5
  • Forks: 0
  • Open Issues: 0
  • Releases: 10
Topics
arche
Created over 6 years ago · Last pushed 9 months ago
Metadata Files
Readme License

README.md

URI Normalizer

Latest Stable Version Build status Coverage Status License

A class for normalizing named entity URIs from services like Geonames, GND, VIAF, ORCID, etc. and retrieving RDF metadata from them.

By default the rules from the arche-assets library are used by you can supply your own ones.

Any PSR-16 compatible cache can be used to speed up normalization/retrieval of reccuring URIs. A combined in-memory and persistent sqlite-based cache implementation is provided as well.

Context

While looking at the named entity database services it's quite often difficult to tell which URL is a canonical URI for a given named entity.

Just let's take a quick look at a bunch (there are definitely more) of Geonames URLs describing exactly same Geonames named entity with id 2761369:

  • http://geonames.org/2761369
  • https://geonames.org/2761369
  • http://www.geonames.org/2761369
  • https://www.geonames.org/2761369
  • http://geonames.org/2761369/vienna
  • https://geonames.org/2761369/vienna
  • http://www.geonames.org/2761369/vienna
  • https://www.geonames.org/2761369/vienna
  • https://www.geonames.org/2761369/vienna/about.rdf
  • https://www.geonames.org/2761369/vienna.html

Which one of them is the right one? The actual answer is quite simple - the one used as an RDF triples subject in the RDF metadata returned by a given service. So the first aim of this package is to provide a tool for transforming any URL coming from a given service and transform it into the canonical URI used by the service in the RDF metadata it returns.

But here we come to another issue - how to fetch the RDF metadata for a given named entity knowing its URI?

For some services (like ORCID or VIAF) it can be done just with an HTTP content negotation by requesting response in one of supported RDF formats. For other though you need to know a service-specific content negotation method, e.g. in Geonames you need to append /about.rdf to the canonical URI. The second aim of this package is to allow you to retrieve RDF metadata from named entity URIs/URLs without being bothered by all those service-specific peculiarities. And as such a retrieval involves quite some time, a caching option is also provided.

Automatically generated documentation

https://acdh-oeaw.github.io/arche-docs/devdocs/classes/acdhOeaw-UriNormalizer.html

Installation

composer require acdh-oeaw/uri-normalizer

Usage

```php

Initialization

$normalizer = new \acdhOeaw\UriNormalizer();

string URL normalization

// returns 'https://sws.geonames.org/2761369/' echo $normalizer->normalize('http://geonames.org/2761369/vienna.html');

EasyRdf resource property normalization

$property = 'https://some.id/property'; $graph = new EasyRdf\Graph(); $resource = $graph->resource('.'); $resource->addResource($property, 'http://aaa.geonames.org/276136/borj-ej-jaaiyat.html'); $normalizer->normalizeMeta($resource, $property); // returns 'https://sws.geonames.org/276136/' echo (string) $resource->getResource($property);

Retrieve parsed/raw RDF metadata from URI/URL

// print parsed RDF metadata retrieved from the geonames $metadata = $normalizer->fetch('http://geonames.org/2761369/vienna.html'); echo $metadata->dump('text') . "\n";

// get a PSR-7 request fetching the RDF metadata for a given geonames URL $request = $normalizer->resolve('http://geonames.org/2761369/vienna.html'); echo $request->getUri() . "\n";

Use your own normalization rules

and supply a custom Guzzle HTTP client (can be any PSR-18 one) supplying authentication

$rules = [ [ "match" => "^https://(?:my.)own.namespace/([0-9]+)(?:/.*)?$", "replace" => "https://own.namespace/\1", "resolve" => "https://own.namespace/\1", "format" => "application/n-triples", ], ]; $client = new \GuzzleHttp\Client(['auth' => ['login', 'password']]); $cache = false; $normalizer = new \acdhOeaw\UriNormalizer($rules, '', $client, $cache); // returns 'https://own.namespace/123' echo $normalizer->normalize('https://my.own.namespace/123/foo'); // obviously won't work but if the https://own.namespace would exist, // it would be queried with the HTTP BASIC auth as set up above $normalizer->fetch('https://my.own.namespace/123/foo');

Use cache

$cache = new \acdhOeaw\UriNormalizerCache('db.sqlite'); $normalizer = new \acdhOeaw\UriNormalizer(cache: $cache); // first retrieval should take 0.1-1 second depending on your connection speed $t = microtime(true); $metadata = $normalizer->fetch('http://geonames.org/2761369/vienna.html'); $t = (microtime(true) - $t); echo $metadata->dump('text') . "\ntime: $t s\n"; // second retrieval should be very quick thanks to in-memory cache $t = microtime(true); $metadata = $normalizer->fetch('http://geonames.org/2761369/vienna.html'); $t = (microtime(true) - $t); echo $metadata->dump('text') . "\ntime: $t s\n"; // a completely separate UriNormalizer instance still benefits from the persistent // sqlite cache $cache2 = new \acdhOeaw\UriNormalizerCache('db.sqlite'); $normalizer2 = new \acdhOeaw\UriNormalizer(cache: $cache); $t = microtime(true); $metadata = $normalizer2->fetch('http://geonames.org/2761369/vienna.html'); $t = (microtime(true) - $t); echo $metadata->dump('text') . "\ntime: $t s\n";

As a global singleton

// initialization is done with init() instead of a constructor // the init() takes same parameters as the constructor \acdhOeaw\UriNormalizer::init(); // all other methods (gNormalize(), gFetch() and gResolve()) also work in // the same way and take same parameters as their non-static counterparts // returns 'https://sws.geonames.org/2761369/' echo \acdhOeaw\UriNormalizer::gNormalize('http://geonames.org/2761369/vienna.html'); // fetch and cache parsed RDF metadata echo \acdhOeaw\UriNormalizer::gFetch('http://geonames.org/2761369/vienna.html')->dump('text'); // fetch and cache raw RDF metadata echo \acdhOeaw\UriNormalizer::gResolve('http://geonames.org/2761369/vienna.html')->getBody(); // normalize EasyRdf Resource property $property = 'https://some.id/property'; $graph = new EasyRdf\Graph(); $resource = $graph->resource('.'); $resource->addResource($property, 'http://aaa.geonames.org/276136/borj-ej-jaaiyat.html'); \acdhOeaw\UriNormalizer::gNormalizeMeta($resource, $property); // returns 'https://sws.geonames.org/276136/' echo (string) $resource->getResource($property);

```

Owner

  • Name: Austrian Centre for Digital Humanities & Cultural Heritage
  • Login: acdh-oeaw
  • Kind: organization
  • Email: acdh@oeaw.ac.at
  • Location: Vienna, Austria

GitHub Events

Total
  • Create event: 3
  • Release event: 2
  • Issues event: 1
  • Delete event: 2
  • Issue comment event: 1
  • Push event: 1
Last Year
  • Create event: 3
  • Release event: 2
  • Issues event: 1
  • Delete event: 2
  • Issue comment event: 1
  • Push event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 46
  • Total Committers: 1
  • Avg Commits per committer: 46.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 8
  • Committers: 1
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Mateusz Żółtak z****k@z****g 46
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 3
  • Total pull requests: 0
  • Average time to close issues: 15 days
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.33
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 0
  • Average time to close issues: 15 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.33
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • zozlak (3)
Pull Request Authors
Top Labels
Issue Labels
enhancement (2) bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • packagist 6,097 total
  • Total dependent packages: 5
  • Total dependent repositories: 3
  • Total versions: 10
  • Total maintainers: 1
packagist.org: acdh-oeaw/uri-normalizer

A simple class for normalizing external entity reference sources' URIs (Geonames, GND, etc. URIs).

  • Versions: 10
  • Dependent Packages: 5
  • Dependent Repositories: 3
  • Downloads: 6,097 Total
Rankings
Dependent packages count: 3.7%
Dependent repos count: 16.9%
Downloads: 17.6%
Average: 20.5%
Forks count: 27.0%
Stargazers count: 37.1%
Maintainers (1)
Funding
Last synced: 6 months ago

Dependencies

composer.json packagist
  • phpunit/phpunit ^9 development
  • acdh-oeaw/arche-assets *
  • acdh-oeaw/easyrdf *
  • php >= 7.1
.github/workflows/test.yml actions
  • actions/checkout v2 composite