https://github.com/dissco/dissco-export-job
Scheduled job to export DiSSCo Data
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary
Repository
Scheduled job to export DiSSCo Data
Basic Info
- Host: GitHub
- Owner: DiSSCo
- License: apache-2.0
- Language: Java
- Default Branch: main
- Size: 263 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 1
Metadata Files
README.md
Export Job
The export job repository contains several job that generate an export product. This repository is closely associated with the exporter-backend which schedules the jobs. Within this repository you can find multiple jobs, the profiles will show which jobs are available. The jobs share the AbstractExportJobService class and are triggered through the ProjectRunner. The exports of the jobs are stored in an S3 bucket and can be downloaded from there (see the S3Repository).
Source System Jobs
There are some jobs which are specifically for a source system.
The results of these jobs will be available through the source-system endpoint and can be used for example by GBIF.
The difference with other jobs is that we include the original eml.xml for the source system.
Jobs
DoiList
This job generates a zipped csv file containing just two columns, the DOI of the specimen and the physicalSpecimenID. It is straightforward, paginates over elastic and retrieves only these two fields and writes the result to a file.
DWCA
This job generates a Darwin Core Archive (DwC-A) file containing all the specimen data.
It paginates over elastic and retrieves all data, it then retrieve all media associated with the specimen.
For building the DwC-A we use the GBIF dwca-io library.
However, we did implement our own DWCA writer (DwcaZipWriter) as the GBIF implementation did not support streaming to a zipped file.
The DwcaZipWriter is based on this GBIF stream writer
Then we build the DwC-A where it is important that we include all fields in the correct order.
Based on the first records we determine the order of the fields and this will determine the meta.xml file.
This is why we use some magic to always include fields that are mapped to a ods:Location even if the location might not be present in the record.
DwC-DP
This job contains some additional complexity as we need to potentially deduplicate records. This means that when the job is started we will create some temporary database tables. Just as the other jobs we will paginate over elastic and retrieve all data, after which we will also collect any media associated with the specimen. We then parse this to DwC-DP records and we generate any identifiers when they are not present. These identifiers are essential in creating the linkages between the different files in the DwC-DP. The generate identifiers are based on the data in the object (essential for deduplication) and we use an MD5 hash. We then store the records in the temporary database tables with the identifier as key and the records as a binary array (blob). This steps takes care of the deduplication as we will ignore any records that have the same primary key (on conflict do nothing). This concludes the processSearchResults step. We then move to the postProcessResults which retrieve each record from the temporary database tables and writes them to the DwC-DP files. We then upload this DwC-DP to the S3 bucket.
Owner
- Name: DiSSCo
- Login: DiSSCo
- Kind: organization
- Email: info@dissco.eu
- Location: Europe
- Website: dissco.eu
- Twitter: disscoeu
- Repositories: 35
- Profile: https://github.com/DiSSCo
Distributed System of Scientific Collections - pan-European Research Infrastructure. Updates on DiSSCo and natural science collections
GitHub Events
Total
- Delete event: 14
- Issue comment event: 58
- Push event: 68
- Pull request review event: 74
- Pull request review comment event: 73
- Pull request event: 56
- Create event: 23
Last Year
- Delete event: 14
- Issue comment event: 58
- Push event: 68
- Pull request review event: 74
- Pull request review comment event: 73
- Pull request event: 56
- Create event: 23
Issues and Pull Requests
Last synced: 5 months ago
All Time
- Total issues: 0
- Total pull requests: 26
- Average time to close issues: N/A
- Average time to close pull requests: 1 day
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.65
- Merged pull requests: 20
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 26
- Average time to close issues: N/A
- Average time to close pull requests: 1 day
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.65
- Merged pull requests: 20
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- southeo (14)
- samleeflang (12)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/cache v1 composite
- actions/checkout v2 composite
- actions/setup-java v1 composite
- anothrNick/github-tag-action 1.36.0 composite
- aquasecurity/trivy-action master composite
- docker/build-push-action v3 composite
- docker/login-action v1 composite
- docker/metadata-action v4 composite
- actions/cache/save v4 composite
- eclipse-temurin 21-jdk-alpine build
- co.elastic.clients:elasticsearch-java 8.15.0
- com.opencsv:opencsv 5.9
- org.projectlombok:lombok
- org.springframework.boot:spring-boot-starter-validation
- org.springframework.boot:spring-boot-starter-web
- org.springframework.boot:spring-boot-starter-webflux
- org.springframework.boot:spring-boot-starter-test test
- org.testcontainers:elasticsearch test