https://github.com/dissco/dissco-export-job

Scheduled job to export DiSSCo Data

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Last synced: 8 months ago · JSON representation

Repository

Scheduled job to export DiSSCo Data

Basic Info

Host: GitHub
Owner: DiSSCo
License: apache-2.0
Language: Java
Default Branch: main
Size: 263 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 1

Created over 1 year ago · Last pushed 8 months ago

Metadata Files

Readme License

Export Job

The export job repository contains several job that generate an export product. This repository is closely associated with the exporter-backend which schedules the jobs. Within this repository you can find multiple jobs, the profiles will show which jobs are available. The jobs share the AbstractExportJobService class and are triggered through the ProjectRunner. The exports of the jobs are stored in an S3 bucket and can be downloaded from there (see the S3Repository).

Source System Jobs

There are some jobs which are specifically for a source system. The results of these jobs will be available through the source-system endpoint and can be used for example by GBIF. The difference with other jobs is that we include the original eml.xml for the source system.

Jobs

DoiList

This job generates a zipped csv file containing just two columns, the DOI of the specimen and the physicalSpecimenID. It is straightforward, paginates over elastic and retrieves only these two fields and writes the result to a file.

DWCA

This job generates a Darwin Core Archive (DwC-A) file containing all the specimen data. It paginates over elastic and retrieves all data, it then retrieve all media associated with the specimen. For building the DwC-A we use the GBIF dwca-io library. However, we did implement our own DWCA writer (DwcaZipWriter) as the GBIF implementation did not support streaming to a zipped file. The DwcaZipWriter is based on this GBIF stream writer Then we build the DwC-A where it is important that we include all fields in the correct order. Based on the first records we determine the order of the fields and this will determine the meta.xml file. This is why we use some magic to always include fields that are mapped to a ods:Location even if the location might not be present in the record.

DwC-DP

This job contains some additional complexity as we need to potentially deduplicate records. This means that when the job is started we will create some temporary database tables. Just as the other jobs we will paginate over elastic and retrieve all data, after which we will also collect any media associated with the specimen. We then parse this to DwC-DP records and we generate any identifiers when they are not present. These identifiers are essential in creating the linkages between the different files in the DwC-DP. The generate identifiers are based on the data in the object (essential for deduplication) and we use an MD5 hash. We then store the records in the temporary database tables with the identifier as key and the records as a binary array (blob). This steps takes care of the deduplication as we will ignore any records that have the same primary key (on conflict do nothing). This concludes the processSearchResults step. We then move to the postProcessResults which retrieve each record from the temporary database tables and writes them to the DwC-DP files. We then upload this DwC-DP to the S3 bucket.

Owner

Name: DiSSCo
Login: DiSSCo
Kind: organization
Email: info@dissco.eu
Location: Europe

Website: dissco.eu
Twitter: disscoeu
Repositories: 35
Profile: https://github.com/DiSSCo

Distributed System of Scientific Collections - pan-European Research Infrastructure. Updates on DiSSCo and natural science collections

GitHub Events

Total

Delete event: 14
Issue comment event: 58
Push event: 68
Pull request review event: 74
Pull request review comment event: 73
Pull request event: 56
Create event: 23

Last Year

Delete event: 14
Issue comment event: 58
Push event: 68
Pull request review event: 74
Pull request review comment event: 73
Pull request event: 56
Create event: 23

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 0
Total pull requests: 26
Average time to close issues: N/A
Average time to close pull requests: 1 day
Total issue authors: 0
Total pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.65
Merged pull requests: 20
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 26
Average time to close issues: N/A
Average time to close pull requests: 1 day
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.65
Merged pull requests: 20
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

southeo (14)
samleeflang (12)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/build.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
actions/setup-java v1 composite
anothrNick/github-tag-action 1.36.0 composite
aquasecurity/trivy-action master composite
docker/build-push-action v3 composite
docker/login-action v1 composite
docker/metadata-action v4 composite

.github/workflows/cache-trivy.yaml actions

actions/cache/save v4 composite

Dockerfile docker

eclipse-temurin 21-jdk-alpine build

pom.xml maven

co.elastic.clients:elasticsearch-java 8.15.0
com.opencsv:opencsv 5.9
org.projectlombok:lombok
org.springframework.boot:spring-boot-starter-validation
org.springframework.boot:spring-boot-starter-web
org.springframework.boot:spring-boot-starter-webflux
org.springframework.boot:spring-boot-starter-test test
org.testcontainers:elasticsearch test

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/dissco/dissco-export-job

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Export Job

Source System Jobs

Jobs

DoiList

DWCA

DwC-DP

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies