https://github.com/dissco/dissco-export-job

Scheduled job to export DiSSCo Data

https://github.com/dissco/dissco-export-job

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary
Last synced: 4 months ago · JSON representation

Repository

Scheduled job to export DiSSCo Data

Basic Info
  • Host: GitHub
  • Owner: DiSSCo
  • License: apache-2.0
  • Language: Java
  • Default Branch: main
  • Size: 263 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 1
Created over 1 year ago · Last pushed 5 months ago
Metadata Files
Readme License

README.md

Export Job

The export job repository contains several job that generate an export product. This repository is closely associated with the exporter-backend which schedules the jobs. Within this repository you can find multiple jobs, the profiles will show which jobs are available. The jobs share the AbstractExportJobService class and are triggered through the ProjectRunner. The exports of the jobs are stored in an S3 bucket and can be downloaded from there (see the S3Repository).

Source System Jobs

There are some jobs which are specifically for a source system. The results of these jobs will be available through the source-system endpoint and can be used for example by GBIF. The difference with other jobs is that we include the original eml.xml for the source system.

Jobs

DoiList

This job generates a zipped csv file containing just two columns, the DOI of the specimen and the physicalSpecimenID. It is straightforward, paginates over elastic and retrieves only these two fields and writes the result to a file.

DWCA

This job generates a Darwin Core Archive (DwC-A) file containing all the specimen data. It paginates over elastic and retrieves all data, it then retrieve all media associated with the specimen. For building the DwC-A we use the GBIF dwca-io library. However, we did implement our own DWCA writer (DwcaZipWriter) as the GBIF implementation did not support streaming to a zipped file. The DwcaZipWriter is based on this GBIF stream writer Then we build the DwC-A where it is important that we include all fields in the correct order. Based on the first records we determine the order of the fields and this will determine the meta.xml file. This is why we use some magic to always include fields that are mapped to a ods:Location even if the location might not be present in the record.

DwC-DP

This job contains some additional complexity as we need to potentially deduplicate records. This means that when the job is started we will create some temporary database tables. Just as the other jobs we will paginate over elastic and retrieve all data, after which we will also collect any media associated with the specimen. We then parse this to DwC-DP records and we generate any identifiers when they are not present. These identifiers are essential in creating the linkages between the different files in the DwC-DP. The generate identifiers are based on the data in the object (essential for deduplication) and we use an MD5 hash. We then store the records in the temporary database tables with the identifier as key and the records as a binary array (blob). This steps takes care of the deduplication as we will ignore any records that have the same primary key (on conflict do nothing). This concludes the processSearchResults step. We then move to the postProcessResults which retrieve each record from the temporary database tables and writes them to the DwC-DP files. We then upload this DwC-DP to the S3 bucket.

Owner

  • Name: DiSSCo
  • Login: DiSSCo
  • Kind: organization
  • Email: info@dissco.eu
  • Location: Europe

Distributed System of Scientific Collections - pan-European Research Infrastructure. Updates on DiSSCo and natural science collections

GitHub Events

Total
  • Delete event: 14
  • Issue comment event: 58
  • Push event: 68
  • Pull request review event: 74
  • Pull request review comment event: 73
  • Pull request event: 56
  • Create event: 23
Last Year
  • Delete event: 14
  • Issue comment event: 58
  • Push event: 68
  • Pull request review event: 74
  • Pull request review comment event: 73
  • Pull request event: 56
  • Create event: 23

Issues and Pull Requests

Last synced: 5 months ago

All Time
  • Total issues: 0
  • Total pull requests: 26
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.65
  • Merged pull requests: 20
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 26
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.65
  • Merged pull requests: 20
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • southeo (14)
  • samleeflang (12)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/build.yaml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • actions/setup-java v1 composite
  • anothrNick/github-tag-action 1.36.0 composite
  • aquasecurity/trivy-action master composite
  • docker/build-push-action v3 composite
  • docker/login-action v1 composite
  • docker/metadata-action v4 composite
.github/workflows/cache-trivy.yaml actions
  • actions/cache/save v4 composite
Dockerfile docker
  • eclipse-temurin 21-jdk-alpine build
pom.xml maven
  • co.elastic.clients:elasticsearch-java 8.15.0
  • com.opencsv:opencsv 5.9
  • org.projectlombok:lombok
  • org.springframework.boot:spring-boot-starter-validation
  • org.springframework.boot:spring-boot-starter-web
  • org.springframework.boot:spring-boot-starter-webflux
  • org.springframework.boot:spring-boot-starter-test test
  • org.testcontainers:elasticsearch test