eosc-data-transfer
Data transfer service for EOSC
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Repository
Data transfer service for EOSC
Basic Info
Statistics
- Stars: 5
- Watchers: 4
- Forks: 5
- Open Issues: 6
- Releases: 14
Metadata Files
README.md
EOSC Data Transfer Service
EOSC Future is a Horizon Europe 2020 project, funded by the European Commission, that implemented the European Open Science Cloud (EOSC) to give European researchers access to a wide web of FAIR data and science-related services. The EOSC Beyond project continues this development to deliver the next generation of EOSC.
This project builds a generic data transfer service that can be used in EOSC to transfer large amounts of data to cloud storage, by just indicating the source and destination. The EOSC Data Transfer Service features a RESTful Application Programming Interface (REST API).
The API covers three sets of functionalities:
This project uses Quarkus, the Supersonic Subatomic Java Framework. It requires Java 17 and Quarkus tooling.
Authentication and authorization
All three groups of API endpoints mentioned above support authorization.
The generic data transfer service behind the EOSC Data Transfer API aims to be agnostic
with regard to authorization, thus the HTTP header Authorization (if present) will be
forwarded as received.
Note that the frontend using this API might have to supply more than one set of credentials: (1) one for the data repository (determined by the DOI used as the source), (2) one for the transfer service that is automatically selected when a destination is chosen, and (3) one for the destination storage system. Only (2) is mandatory.
The API endpoints that parse DOIs usually call APIs that are open access, however the
HTTP header Authorization (if present) will be forwarded as received. This ensures that
the EOSC Data Transfer API can be extended with new parsers
for data repositories that require authentication.
The API endpoints that create and manage transfers, as well as the ones that manage storage
elements, do require authorization, in the form of an access token passed via the HTTP
header Authorization. This gets passed to the
transfer service registered to handle the destination storage.
The challenge is that some storage systems used as the target of the transfer may need
a different authentication and/or authorization (than the one the transfer service uses).
Thus, an additional set of credentials can be supplied to the endpoints in these groups
via the HTTP header Authorization-Storage.
For example, for transfers to dCache, the configured transfer service that handles the transfers is EGI Data Transfer. These both can use the same EGI Check-in access token, thus no additional credentials are needed besides the access token for the transfer service, passed via the
AuthorizationHTTP header.
When used, the HTTP header parameter Authorization-Storage receives a
key value pair, separated by a colon (:), no leading or trailing whitespace, which
is Base-64 encoded.
For example, to pass a username and password to the destination storage, you construct a string like
username:password, then Base-64 encoded it todXNlcm5hbWU6cGFzc3dvcmQ=, and finally pass this through the HTTP headerAuthorization-Storagewhen calling e.g. the endpointGET /storage/folder/list.
Parsing DOIs
The API supports parsing digital object identifiers (DOIs) and will return a list of files in the repository indicated by the DOI. It will automatically identify the DOI type and will use the correct parser to retrieve the list of source files.
DOIs are persistent identifiers (PIDs) dedicated to identification of content over digital networks. These are registered by one of the registration agencies of the International DOI Foundation. Although in this documentation we refer to DOIs, the API endpoint that parses DOIs supports any PID registered in the global handle system of the DONA Foundation, provided it points to a data repository for which a parser is configured.
Supported data repositories
The API supports parsing DOIs to the following data repositories:
- Zenodo
- B2SHARE
- European Synchrotron Radiation Facility
- Any data repository that supports Signposting
Integrating new DOI parsers
The API endpoint GET /parser that parses DOIs is extensible. All you have to do is
implement the parser interface for a specific data repository, then register
the Java class implementing the interface in the configuration.
1. Implement the interface for a generic DOI parser
Implement the Java interface ParserService in a class of your choice.
java
public interface ParserService {
boolean init(ParserConfig config, PortConfig port);
String getId();
String getName();
String sourceId();
Uni<Tuple2<Boolean, ParserService>> canParseDOI(String auth, String doi, ParserHelper helper);
Uni<StorageContent> parseDOI(String auth, String doi, int level);
}
Your class must have a constructor that receives a
String id, which must be returned by the methodgetId().
When the API GET /parser is called to parse a DOI, all configured parsers will be tried,
by calling the method canParseDOI(), until one is identified that can parse the DOI. If no
parser can handle the DOI, the API fails. In case your implementation of the method
canParseDOI() cannot determine if your parser can handle a DOI just from the URN,
you can use the passed in ParserHelper to check if the URN redirects to the data
repository you support.
After a parser is identified, the methods init() and parseDOI() are called in order.
The same
ParserHelperis used when trying all parsers for a DOI. This helper caches the redirects, so you should trygetRedirectedToUrl()before incurring one or more network calls by callingcheckRedirect().
2. Add configuration for the new DOI parser
Add a new entry in the configuration file under eosc/parser for the
new parser, with the following settings:
nameis the human-readable name of the data repository.classis the canonical Java class name that implements the interfaceParserServicefor the data repository.urlis the base URL for the REST client that will be used to call the API of this data repository (optional).timeoutis the maximum timeout in milliseconds for calls to the data repository. If not supplied, the default value 5000 (5 seconds) is used.
Creating and managing data transfers
The API supports creation of new data transfers (aka jobs), finding data transfers, querying information about data transfers, and canceling data transfers.
Every API endpoint that performs operations on or queries information about data transfers or storage elements in a destination storage has to be passed the destination storage type. This selects the data transfer service that will be used to perform the data transfer, freeing the clients of the API from having to know which data transfer service to pick for each destination. Each destination storage type is mapped to exactly one data transfer service in the configuration.
Note that the API uses the concept of a storage type, instead of the protocol type, to select the transfer service. This makes the API flexible, by allowing multiple destination storages that use the same protocol to be handled by different transfer services, but at the same time it also allows an entire protocol (e.g. FTP, see below) to be handled by a specific transfer service.
If you do not supply the
destquery parameter when making an API call to perform a transfer or a storage element related operation or query, the default valuedcachewill be supplied instead.
Supported transfer destinations
Initially, the EGI Data Transfer is integrated into the EOSC Data Transfer API, supporting the following destination storages:
- dCache
- StoRM
- S3-compatible object storages
- FTP servers
Multiple instances of each supported transfer service can be configured, then you can mix and match what protocol(s) and/or storage type(s) each of them will handle.
Integrating new data transfer services
The API for creating and managing data transfers is extensible. All you have to do is implement the generic data transfer interface to wrap a specific data transfer service, then register your class implementing the interface as the handler for one or more destination storage types.
1. Implement the interface for a generic data transfer service
Implement the Java interface TransferService in a class of your choice.
```java
public interface TransferService {
boolean initService(TransferServiceConfig config);
String getServiceName();
String translateTransferInfoFieldName(String genericFieldName);
Uni
// Methods for data transfers
Uni<TransferInfo> startTransfer(String tsAuth, String storageAuth, Transfer transfer);
Uni<TransferList> findTransfers(String tsAuth, String fields, int limit,
String timeWindow, String stateIn,
String srcStorageElement, String dstStorageElement,
String delegationId, String voName, String userDN);
Uni<TransferInfoExtended> getTransferInfo(String tsAuth, String jobId);
Uni<Response> getTransferInfoField(String tsAuth, String jobId, String fieldName);
Uni<TransferInfoExtended> cancelTransfer(String tsAuth, String jobId);
// Methods for storage elements
Uni<StorageContent> listFolderContent(String tsAuth, String storageAuth, String folderUrl);
Uni<StorageElement> getStorageElementInfo(String tsAuth, String storageAuth, String seUrl);
Uni<String> createFolder(String tsAuth, String storageAuth, String folderUrl);
Uni<String> deleteFolder(String tsAuth, String storageAuth, String folderUrl);
Uni<String> deleteFile(String tsAuth, String storageAuth, String fileUrl);
Uni<String> renameStorageElement(String tsAuth, String storageAuth, String seOld, String seNew);
} ```
Your class must have a constructor with no parameters.
The methods can be split into two groups:
- The methods for handling data transfers must be implemented
- The methods for storage elements should only be implemented for storage types
for which the method
canBrowseStorage()returnstrue.
2. Add configuration for the new data transfer service
Add a new entry in the configuration file under eosc/transfer/service
for the new transfer service, with the following settings:
nameis the human-readable name of this transfer service.classis the canonical Java class name that implements the interfaceTransferServicefor this transfer service.urlis the base URL for the REST client that will be used to call the API of this transfer service.timeoutis the maximum timeout in milliseconds for calls to the transfer service. If not supplied, the default value 5000 (5 seconds) is used.trust-store-fileis an optional path to a keystore file containing certificates that should be trusted when connecting to the transfer service. Use it when the CA that issued the certificate(s) of the transfer service is not one of the well known-root CAs. The path is relative to foldersrc/main/resources.trust-store-passwordis the optional password to the keystore file.
3. Register new destinations serviced by the new data transfer service
Add entries in the configuration file under eosc/transfer/destination
for each destination storage type you want to support, and map it to one of the registered
transfer services.
The configuration of each storage type consists of:
serviceis the key of the transfer service that will handle transfers to this storage type.descriptionis the human-readable name of this storage type.authis the type of authentication required by the storage system, one of these values:- token means the storage uses the same OIDC auth token as the transfer service
- password means the storage needs a username and a password for authentication
- keys means the storage needs an access key and a secret key for authentication
protocolis the schema to use in URLs pointing to this storage.browsesignals whether the storage supports browsing (the endpoints to list and manage storage elements are available).
For storage types that are configured with either password or keys as the authentication
type, you will have to supply the HTTP header parameter Authorization-Storage when calling
the API endpoints. See here for details.
4. Add the new destinations in the enum of possible destination
In the enum Transfer.Destination add new values for each of the storage types
you added in the previous step. Use the same values as the names of the keys.
This way each entry under the node eosc/transfer/destination in the configuration
file becomes one possible value for the destination storage parameter dest of the
API endpoints.
Managing storage elements
A storage element is where user's data is stored. It is a generic term meant to hide the complexity of different types of storage technologies. It can mean both an element of the storage system's hierarchy (directory, folder, container, bucket, etc.) and the entity that stores the data (file, object, etc.).
The API supports managing storage elements in a destination storage. Each data transfer
service that gets integrated can optionally implement this functionality. Moreover, data
transfer services that support multiple storage types can selectively implement this
functionality for just a subset of the supported storage types (see the
method TransferService::canBrowseStorage()
above).
Clients can query if this functionality is implemented for a storage type by
using the endpoint GET /storage/info.
This functionality covers:
- listing the content of a storage element
- query information about a storage element
- rename a storage element (including its path, which means to move it)
- delete a storage element
- create a hierarchical storage element (folder/container/bucket)
Storage elements that store data (files/objects) can only be created by a data transfer, not by the API endpoints in this group.
Configuration
The application configuration file is in src/main/resources/application.yml.
See here for how to configure the data repository parsers used by the API and here for how to extend the API with new transfer services and storage types.
The API automatically generates metrics for the calls to the endpoints. You can also enable
histogram buckets for these metrics, to support quantiles in the telemetry dashboards (e.g.
the 0.95-quantile aka the 95th percentile of a metric). List the quantiles you want to
generate under the setting eosc/qos/quantiles. Similarly, you can also enable buckets for
service level objectives (SLOs) by listing all SLOs for call duration, expressed in milliseconds,
under the setting eosc/qos/slos.
You can review the metrics generated by the API at http://localhost:8081/metrics <!-- markdownlint-enable no-bare-urls -->
The API needs a certificate for an EGI service account to register and configure S3 storage systems with the wrapped EGI Data Transfer service. Such a certificate is included in the file
src/main/resources/fts-keystore.jks, but the password for this certificate is not included. Contact EGI to obtain a service account and a new certificate (together with its password) when deploying this API.
Running the API in dev mode
You can run your application in dev mode that enables live coding using:
shell script
./mvnw compile quarkus:dev
Then open the Dev UI, which is available in dev mode only, at http://localhost:8081/q/dev/. <!-- markdownlint-enable no-bare-urls -->
Packaging and running the API
The application can be packaged using:
shell script
./mvnw package
It produces the quarkus-run.jar file in the target/quarkus-app/ directory.
Be aware that it’s not an über-jar as the dependencies are copied into the
target/quarkus-app/lib/ directory.
The application is now runnable using java -jar target/quarkus-app/quarkus-run.jar.
If you want to build an über-jar, execute the following command:
shell script
./mvnw package -Dquarkus.package.type=uber-jar
The application, packaged as an über-jar, is now runnable using java -jar target/*-runner.jar.
Running the API with telemetry using Docker Compose
You can use Docker Compose to easily deploy and run the EOSC Data Transfer API. This will run multiple containers:
- This application that implements the REST API and serves it over HTTP
- SSL terminator - decrypts HTTPS traffic and forwards requests to the API
- OpenTelemetry collector - collects, batch processes, and forwards traces/metrics
- Jaeger - receives traces
- Loki - receives logs
- Prometheus - scrapes metrics
- Grafana - visualization of the telemetry dashboard
The architecture and interaction between these containers is illustrated below:

Steps to run the API in a container:
Copy the file
src/main/docker/.env.templatetosrc/main/docker/.env, then:- Provide the domain name and port where you will deploy the API in the environment
variables
SERVICE_DOMAINandSERVICE_PORT, respectively. - Provide an email address in the environment variable
SERVICE_EMAILto be used, together with the domain name, to automatically request a SSL certificate for the SSL terminator. - In the environment variable
FTS_KEY_STORE_FILEprovide a path to a Java keystore file containing a new EGI service account certificate, and in the environment variableFTS_KEY_STORE_PASSWORDprovide the password for it. - In the environment variable
TELEMETRY_PORTprovide the port on which to publish the Grafana telemetry dashboard. This will be available on the same domain name as the API itself.
- Provide the domain name and port where you will deploy the API in the environment
variables
Run the command
build.sh(orbuild.cmdon Windows) to build and run the containers that implement the EOSC Data Transfer API.The SSL terminator will automatically use Let's Encrypt to request an SSL certificate for HTTPS.
After the SSL terminator container is deployed and working properly, connect to it and make sure it is requesting an actual HTTPS certificate. By default, it will use a self-signed certificate and will only do dry runs for requesting a certificate to avoid the rate limits of Let's Encrypt. To do this:
- Run the command
sudo docker exec -it data-transfer-ssl /bin/shthen - In the container change directory
cd /opt - Edit the file
request.shand remove thecertbotparameter--dry-run
In case you remove the containers of the EOSC Data Transfer API, retain the volume
certificates, which contains the SSL certificate. This will avoid requesting a new one for the same domain, in case you redeploy the API (prevents exceeding Let's Encrypt rate limit).
Related Guides
- REST server implementation Writing reactive REST services
- REST client implementation: REST client to easily call APIs
- Configuration reference: Configuration reference guide
- YAML Configuration: Use YAML to configure your application
- Introduction to CDI: Contexts and dependency injection guide
- OpenTelemetry support: Adding observability to your application
- Metrics with Micrometer: Sending API metrics to Prometheus
- Swagger UI: User-friendly UI to document and test your API
- Mutiny Guides: Reactive programming with Mutiny
- Optionals: How to use Optional in Java
Owner
- Name: EGI Federation
- Login: EGI-Federation
- Kind: organization
- Email: contact@egi.eu
- Location: Science Park, Amsterdam, The Netherlands
- Website: https://egi.eu
- Twitter: EGI_eInfra
- Repositories: 137
- Profile: https://github.com/EGI-Federation
Advanced Computing for Research
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Farkas
given-names: Levente
orcid: https://orcid.org/0009-0004-8603-3467
title: "EOSC Data Transfer API"
version: 1.1.69
identifiers:
- type: doi
value: 10.5281/zenodo.7925514
date-released: 2024-11-29
GitHub Events
Total
- Create event: 3
- Issues event: 3
- Release event: 1
- Watch event: 2
- Delete event: 2
- Issue comment event: 1
- Member event: 1
- Push event: 14
- Pull request event: 4
- Fork event: 1
Last Year
- Create event: 3
- Issues event: 3
- Release event: 1
- Watch event: 2
- Delete event: 2
- Issue comment event: 1
- Member event: 1
- Push event: 14
- Pull request event: 4
- Fork event: 1
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 2
- Total pull requests: 3
- Average time to close issues: 9 days
- Average time to close pull requests: about 1 month
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 2
Past Year
- Issues: 2
- Pull requests: 3
- Average time to close issues: 9 days
- Average time to close pull requests: about 1 month
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 2
Top Authors
Issue Authors
- chbrandt (2)
Pull Request Authors
- dependabot[bot] (2)
- gonimoro (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- io.quarkus.platform:quarkus-bom 2.13.4.Final import
- io.quarkus:quarkus-arc
- io.quarkus:quarkus-config-yaml
- io.quarkus:quarkus-logging-json
- io.quarkus:quarkus-rest-client-reactive-jackson
- io.quarkus:quarkus-resteasy-reactive-jackson
- io.quarkus:quarkus-smallrye-openapi
- io.quarkus:quarkus-vertx
- io.smallrye.reactive:smallrye-mutiny-vertx-web-client
- io.quarkus:quarkus-junit5 test
- io.rest-assured:rest-assured test
- actions/checkout v3 composite
- gaurav-nelson/github-action-markdown-link-check v1 composite
- actions/checkout v3 composite
- docker://ghcr.io/github/super-linter slim-v4 composite
- nginx 1.21-alpine build
- maven 3-eclipse-temurin-17 build
- registry.access.redhat.com/ubi8/openjdk-17 1.11 build