https://github.com/dissco/dissco-core-digital-media-object-processor

Repository for the code of the digital media object processing service

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary

Last synced: 8 months ago · JSON representation

Repository

Repository for the code of the digital media object processing service

Basic Info

Host: GitHub
Owner: DiSSCo
License: apache-2.0
Language: Java
Default Branch: main
Size: 365 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 1
Releases: 1

Archived

Created almost 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme License

This Repository Has Been Archived

17/06/2025 - The digital media processor has been archived. Its functionality has been merged with the specimen processor.

digital-media-object-processor

The digital media objects processor can receive digital media object from two different sources: - Through the API as a request to register a digital media object - Through a Kafka queue to register an annotation

Preparation

The digital media object processes the received objects as a batch. To ensure that there are no conflicts, we first ensure that the batch only contains unique objects. Objects which are updated twice in one batch are republished to the queue. After this, we will collect the digital specimen PID from the digital specimen database. In the received data only the physical specimen id is available, so we need to collect the digital specimen id. If no digital specimen id is available in the database, we will requeue the item as the specimen has not been processed yet. After this, the system will evaluate if the received objects are new, updated or equal to the current objects. To evaluate this the system takes the following steps: - Check if the object is already in the system, based on the properties: - digital specimen id - media url
We assume that these two properties make the digital media object unique and can be used as keys. No single specimen may have a digital media object with the same media url. However, the same media url can be used with a different specimen, for example when a single image show multiple specimens.

Evaluation with existing objects

We will now create three lists: - A list with new items when the digital media object cannot be found in the database - A list with updated items when a digital media object is found, but it differs from the newly received object - A list with equal items when the digital media object is found but is equal to the newly received object
These three list are returned and processed in order

New digital media objects

For new digital media objects, we will create a new Handle and transfer the object to a record (adding version and timestamp). We then push the newly create records to the database to persist them. After the insertion in the database, we bulk index them in Elasticsearch. After successful indexing, we publish a CreateUpdateDelete message to Kafka. The last step is to publish an event to the different requested automated annotation services. If everything is successful, we return the created objects, this is used as response object for the web version.

Exception handling

When the digital media object creation fails, we will roll back on several points. If the indexing in Elasticsearch fails, we will roll back the database insert and the handle creation. If the publishing of the CreateUpdateDelete message fails, we will also roll back the indexing.

Updated digital media objects

For update digital media objects we check if we need to update the handle record and if so update it and increment the version. Next we create the digital media object records where we increment the version and create a new timestamp for the version. We persist the new digital media record to the database, where we overwrite the old data. After successful database insert, we bulk index the digital media object, overwriting the old data. We publish a CreateDeleteUpdate event to Kafka. If everything was successful, we will return the updated records.

Exception handling

When an update on the digital media object fails, we roll back on several points. If the indexing fails, we roll back to the previous version, which means we reinsert the old version to the handle and database. If the publishing of the CreateUpdateDelete message fails, we will also roll back the indexing and index the previous version.

Equal digital media objects

When the stored digital media objects and the received digital media objects are equal, we will only update the last_checked timestamp. We will do a batch update to the particular field with the current timestamp to indicate the object were checked and equal at this moment. We will not return the equal objects as we didn't change the data.

Run locally

To run the system locally, it can be run from an IDEA. Clone the code and fill in the application properties (see below). The application needs to store data in a database and an Elasticsearch instance. In Kafka mode it needs a kafka cluster to connect to and receive the messages from.

Run as Container

The application can also be run as container. It will require the environmental values described below. The container can be built with the Dockerfile, in the root of the project.

Profiles

There are two profiles with which the application can be run:

Web

spring.profiles.active=web
This listens to an API which has one endpoint: - POST / This endpoint can be used to post a digital media object event to the processing service. After this it will follow the above described process. It will return the newly create or updated objects.

If an exception occurs during processing, it will be published to the Kafka Dead Letter queue. We can than later evaluate why the exception was thrown and if needed, retry the object.

Kafka

spring.profiles.active=web This will make the application listen to a specified queue and process the digital media object events from the queue. We collect the objects in batches of between 300-500 (depending on amount in queue). If any exception occurs we publish the event to a Dead Letter Queue where we can evaluate the failure and if needed retry the messages.

Environmental variables

The following backend specific properties can be configured:

```

Database properties

spring.datasource.url=# The JDBC url to the PostgreSQL database to connect with spring.datasource.username=# The login username to use for connecting with the database spring.datasource.password=# The login password to use for connecting with the database

Elasticsearch properties

elasticsearch.hostname=# The hostname of the Elasticsearch cluster elasticsearch.port=# The port of the Elasticsearch cluster elasticsearch.index-name=# The name of the index for Elasticsearch

Kafka properties (only necessary when the kafka profile is active)

kafka.publisher.host=# The host address of the kafka instance to which the application will publish the CreateUpdateDelete events kafka.consumer.host=# The host address of the kafka instance from which the application will consume the Annotation events kafka.consumer.group=# The group name of the kafka group from which the application will consume the Annotation events kafka.consumer.topic=# The topic name of the kafka topic from which the application will consume the Annotation events

Keycloak properties (only necessary when the web profile is active

keycloak.auth-server-url=# Server url of the auth endpoint of Keycloak keycloak.realm=# Keycloak realm keycloak.resource=# Resource name of the Keycloak resource keycloak.ssl-required=# SSL required for the communication with Keycloak keycloak.use-resource-role-mappings=# Resource Role Mapping true/false keycloak.principal-attribute=# Pricinpal Attribute in JWT to check keycloak.confidential-port=# Confidentail Keycloak port keycloak.always-refresh-token=# ALways refresh token true/false keycloak.bearer-only=# Only use bearer token for authentication true/false

Owner

Name: DiSSCo
Login: DiSSCo
Kind: organization
Email: info@dissco.eu
Location: Europe

Website: dissco.eu
Twitter: disscoeu
Repositories: 35
Profile: https://github.com/DiSSCo

Distributed System of Scientific Collections - pan-European Research Infrastructure. Updates on DiSSCo and natural science collections

GitHub Events

Total

Release event: 1
Delete event: 8
Issue comment event: 22
Push event: 30
Pull request review event: 10
Pull request review comment event: 2
Pull request event: 32
Create event: 34

Last Year

Release event: 1
Delete event: 8
Issue comment event: 22
Push event: 30
Pull request review event: 10
Pull request review comment event: 2
Pull request event: 32
Create event: 34

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 10
Average time to close issues: N/A
Average time to close pull requests: 4 minutes
Total issue authors: 0
Total pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.4
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 10
Average time to close issues: N/A
Average time to close pull requests: 4 minutes
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.4
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

southeo (23)
samleeflang (10)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

pom.xml maven

org.keycloak.bom:keycloak-adapter-bom 16.1.1 import
co.elastic.clients:elasticsearch-java 8.4.1
com.fasterxml.jackson.core:jackson-databind
com.fasterxml.jackson.dataformat:jackson-dataformat-xml
com.fasterxml.jackson.datatype:jackson-datatype-jsr310
com.github.java-json-tools:json-patch 1.13
jakarta.json:jakarta.json-api 2.1.1
org.keycloak:keycloak-spring-boot-starter
org.postgresql:postgresql
org.projectlombok:lombok
org.springframework.boot:spring-boot-configuration-processor
org.springframework.boot:spring-boot-starter-jooq
org.springframework.boot:spring-boot-starter-security
org.springframework.boot:spring-boot-starter-validation
org.springframework.boot:spring-boot-starter-web
org.springframework.kafka:spring-kafka
org.springframework.boot:spring-boot-starter-test test
org.springframework.kafka:spring-kafka-test test

.github/workflows/build.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
actions/setup-java v1 composite
anothrNick/github-tag-action 1.36.0 composite
docker/build-push-action v3 composite
docker/login-action v1 composite
docker/metadata-action v4 composite

Dockerfile docker

eclipse-temurin 17-alpine build

https://github.com/dissco/dissco-core-digital-media-object-processor

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

This Repository Has Been Archived

digital-media-object-processor

Preparation

Evaluation with existing objects

New digital media objects

Exception handling

Updated digital media objects

Exception handling

Equal digital media objects

Run locally

Run as Container

Profiles

Web

Kafka

Environmental variables

Database properties

Elasticsearch properties

Kafka properties (only necessary when the kafka profile is active)

Keycloak properties (only necessary when the web profile is active

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies