https://github.com/dissco/dissco-nusearch-service
This fork will contain the DiSSCo specific implementation of the name usage search of the col-nusearch-service.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Repository
This fork will contain the DiSSCo specific implementation of the name usage search of the col-nusearch-service.
Basic Info
- Host: GitHub
- Owner: DiSSCo
- License: apache-2.0
- Language: Java
- Default Branch: main
- Size: 2.13 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 4
Metadata Files
README.md
DiSSCo Name Usage Searcher
DiSSCo
The DiSSCo implementation of the Name Usage Searcher.
It connects to a RabbitMQ queue where it will consume DigitalSpecimen events.
It will then retrieve the scientific name and classification from the event and query the Name Usage Searcher for the matching records.
If a match is found it will override the taxonomic information in the DigitalSpecimen event with the information from the Name Usage Searcher.
The original scientificName will be stored in the verbatimIdentification, seperated by a pipe | if there were multiple taxonIdentifications.
The link to the Catalogue of Life will be added as a entityRelationship to the DigitalSpecimen event.
Last we will update the specimenName and the topicDiscipline according to the taxonomic identification.
The updated event will be sent to a new RabbitMQ topic for further processing.
The best way is to first generate an index and store it on S3 then run the S3 Resolver profile to deploy the APIs/RabbitMQ.
DiSSCo specific properties
```
RabbitMQ properties
rabbitmq.queue-name=# The name of the RabbitMQ queue from which the nu-search-service will consume rabbitmq.dlq-exchange-name=# The name of the RabbitMQ dead letter exchange to which nu-search-service will publish failed messages rabbitmq.dlq-routing-key-name=# The routing key name for the RabbitMQ dead letter exchange rabbitmq.exchange-name=# The name of the RabbitMQ exchange to which the nu-search-service will publish processed messages rabbitmq.routing-key-name=# The routing key name for RabbitMQ to which the nu-search-service will publish processed messages ```
Original README
This is a small application that allows you to match names against taxonomies. It is based on and reuses code developed by the Global Biodiversity Information Facility (GBIF). In addition to the GBIF code, this application uses parts of the code developed by Catalogue of Life. In essence, it combines the search capabilities of the GBIF name parser with the taxonomic data of the ChecklistBank. It enables the user to pick any dataset from ChecklistBank and run the GBIF matching API on top of it. This choice was made to reuse as much as possible to avoid conflicting algorithms and data structures. Additional functional has been added, such as the auto-complete endpoint and the batch endpoint.
Purpose
The intended purpose is to enable infrastructures to run a name matching index within their own infrastructure, without the need to rely on external services or put their load on them. This will increase the stability, reliability and speed of the name matching which is especially relevant when matching automatically resolving large amounts of names. This application is designed to be run as a service, and to be queried by other services. It takes as input a ColDP dataset from checklistBank, creates a local index and exposes it through an API. While it is possible to expose the API to the public domain this is not the intended purpose of this application. For querying the checklist bank data we would recommend using the Checklistbank API.
Description
When running with all options available the application will have the following flow: - Retrieves a ColDP dataset from ChecklistBank based on the datasetKey and stores it locally - Iterates over the NameUsage.tsv file in the ColDP dataset and loads all records in a map (colId -> NameUsage) - Iterates over the NameUsage.tsv file a second time but now loads the records into a lucene index. The first iteration is needed to be able to quickly build the full taxonomic tree of the record. In the second iteration it will add all related name usages to the record based on the parentId. It will loop over all records until it can no longer find a parentId, indicating that it reached the root of the tree. It will then store all information into a lucene index, searches can be done on canonical name and colId.
This concludes the indexing part of the application. This part can be run separately or be disabled when there is an existing index that can be used, see env variables.
The application will then start a public web server that exposes the index through an API.
The API will allow you to search for a name and return the matching records.
If the request contains an colId it will return the full taxonomic tree of the record.
This endpoint has both an option for single requests (match), as well as batch requests (batch).
In addition to the name resolution service there is also an auto-complete endpoint (auto-complete).
This does a prefix search on the index and searches for records which start with the provided name.
Profiles
This application uses Spring profiles as feature toggles. Certain functionality can be enabled or disabled by setting the correct profile.
Standalone
In this mode the application is compeletly standalone and does not require any external services. It will download the COL dataset and index it locally. After indexing it will expose the index through an API.
S3 Indexer
The S3 Indexer will run only the downloading and indexing part of the application.
After it has created a lucene index it will upload the index to an S3 bucket (col-indices).
The Lucene files are prefixed with the COL dataset identifier.
S3 Resolver
This S3 Resolver profile will download an existing index from an S3 bucket (col-indices) and expose it through an API.
OpenAPI documentation
The API is documented using OpenAPI.
The documentation can be found at the root of the application when running the application at: localhost:8080/v3/api-docs
Swagger documentation is included as well at: localhost:8080/swagger-ui/index.html
Environmental variables
The following backend specific properties can be configured:
```
Indexing properties
The following properties are required for the indexing of the dataset. indexing.col-dataset=# The identifier of the col dataset. The identifier can be retrieved by looking at the the address bar when viewing a dataset in the checklistbank. For example https://www.checklistbank.org/dataset/2014/about is the dataset with identifier 2014. indexing.col-username=# The username to use for authenticating with the ChecklistBank. You can create an account for ChecklistBank at https://www.gbif.org/user/profile indexing.col-password=# The passowrd to use for authenticating with the ChecklistBank. You can create an account for ChecklistBank at https://www.gbif.org/user/profile
The following properties are optionally and have a default value. indexing.index-location=# The location where the index is stored. Default is src/main/resources/index indexing.temp-coldp-location=# The location where the ColDP dataset is stored. Default is src/main/resources/sample.zip
Col properties
These properties are used when downloading the COL Data Package from the ChecklistBank. All properties have a sensible default which can be overwritten. col.synonyms=# Whether to include synonyms in the download. Default is true col.extended=# Whether to include extended data in the download. Default is true col.extinct=# Whether to include extinct species in the download. Default is true col.export-status-retry-count=# The amount of times to retry the export status. Default is 10 col.export-status-retry-interval=# The interval between retries in milliseconds. Default is 500 ms (0.5 sec)
AWS properties
These are properties required for making the connection to the S3 bucket on AWS. aws.accessKeyId=# The access key id for the AWS account aws.secretAccessKey=# The secret access key for the AWS account ```
Install and run
The preferred way to run the application is through container images.
DiSSCo will provide container images for the application through our public image repository.
The application can be run with the following command:
docker run -p 8080:8080 public.ecr.aws/dissco/name-usage-searcher:latest
Docker-Compose
In addition to running the application through a container image, it is also possible to run the application through docker-compose.
An example docker-compose file has been added to the repository.
The command docker-compose up will start the application and expose it on port 8080.
IDE
The application can be run through a Java Integrated Development Environment (IDE) such as IntelliJ. The environmental variables can be set in the application.properties file. This way of running is especially helpful when developing or testing the application.
Other
Other options of running are possible, as the .jar file is generated and can be found in the target folder after running maven.
Extending the application
This project is meant as a base for further development. It provides some general functionality which could be sufficient for some use cases. However, when specific implementations surrounding data models, additional APIs or security is needed you might need to fork this project. Forking can be done easily through, for example the GitHub interface. Functionality for which you believe everyone might benefit can be contributed back to the original project through a pull request.
License
This project is licensed under Apache License 2.0 - see the LICENSE file for details. This license is in line with the code license for both the GBIF code and the Catalogue of Life code.
Attribution
Each class which was in part or in whole based on the GBIF code or the Catalogue of Life code contains a reference to the original code at the top of the class. This is to ensure that the original authors are attributed for their work.
Owner
- Name: DiSSCo
- Login: DiSSCo
- Kind: organization
- Email: info@dissco.eu
- Location: Europe
- Website: dissco.eu
- Twitter: disscoeu
- Repositories: 35
- Profile: https://github.com/DiSSCo
Distributed System of Scientific Collections - pan-European Research Infrastructure. Updates on DiSSCo and natural science collections
GitHub Events
Total
- Release event: 3
- Watch event: 1
- Delete event: 4
- Push event: 17
- Public event: 1
- Pull request review comment event: 18
- Pull request review event: 22
- Pull request event: 12
- Create event: 12
Last Year
- Release event: 3
- Watch event: 1
- Delete event: 4
- Push event: 17
- Public event: 1
- Pull request review comment event: 18
- Pull request review event: 22
- Pull request event: 12
- Create event: 12
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: 5 days
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: 5 days
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- southeo (3)
- samleeflang (3)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/cache v1 composite
- actions/checkout v2 composite
- actions/setup-java v1 composite
- anothrNick/github-tag-action 1.36.0 composite
- docker/build-push-action v3 composite
- docker/login-action v1 composite
- docker/metadata-action v4 composite
- actions/cache/save v4 composite
- eclipse-temurin 21-jdk-alpine build
- eclipse-temurin 21-jre-jammy build
- com.univocity:univocity-parsers 2.9.1
- commons-io:commons-io 2.17.0
- io.swagger.core.v3:swagger-annotations 2.2.25
- org.apache.commons:commons-compress 1.27.1
- org.apache.lucene:lucene-analysis-common 9.9.1
- org.apache.lucene:lucene-core 9.9.1
- org.apache.lucene:lucene-queryparser 9.9.1
- org.gbif:gbif-api 1.17.0-SNAPSHOT
- org.gbif:gbif-common 0.61-SNAPSHOT
- org.gbif:gbif-common-ws 1.28-SNAPSHOT
- org.gbif:gbif-parsers 0.67
- org.gbif:name-parser 3.10.2-SNAPSHOT
- org.gbif:name-parser-v1 3.10.2-SNAPSHOT
- org.jacoco:jacoco-maven-plugin 0.8.11
- org.projectlombok:lombok
- org.springdoc:springdoc-openapi-starter-webmvc-ui 2.6.0
- org.springframework.boot:spring-boot-configuration-processor
- org.springframework.boot:spring-boot-starter
- org.springframework.boot:spring-boot-starter-actuator
- org.springframework.boot:spring-boot-starter-oauth2-resource-server
- org.springframework.boot:spring-boot-starter-security
- org.springframework.boot:spring-boot-starter-validation
- org.springframework.boot:spring-boot-starter-web
- org.springframework.boot:spring-boot-starter-webflux
- org.springframework.kafka:spring-kafka
- software.amazon.awssdk:aws-crt-client 2.29.6
- software.amazon.awssdk:s3 2.29.6
- software.amazon.awssdk:s3-transfer-manager 2.29.6
- com.squareup.okhttp3:mockwebserver 4.12.0 test
- org.junit.jupiter:junit-jupiter-api test
- org.junit.jupiter:junit-jupiter-engine test
- org.mockito:mockito-inline 5.2.0 test
- org.springframework.boot:spring-boot-starter-test test