sparql-llm
ðĶâĻ Chat system and reusable components to improve LLMs capabilities when generating SPARQL queries
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
âCITATION.cff file
Found CITATION.cff file -
âcodemeta.json file
Found codemeta.json file -
â.zenodo.json file
Found .zenodo.json file -
âDOI references
-
âAcademic publication links
Links to: arxiv.org -
âCommitters with academic emails
-
âInstitutional organization owner
-
âJOSS paper metadata
-
âScientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary
Keywords
Repository
ðĶâĻ Chat system and reusable components to improve LLMs capabilities when generating SPARQL queries
Basic Info
- Host: GitHub
- Owner: sib-swiss
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://chat.expasy.org
- Size: 11.3 MB
Statistics
- Stars: 56
- Watchers: 4
- Forks: 9
- Open Issues: 2
- Releases: 0
Topics
Metadata Files
README.md
This project provides tools to enhance the capabilities of Large Language Models (LLMs) in generating SPARQL queries for specific endpoints:
- reusable components in
src/sparql-llmand published as thesparql-llmpip package - a complete chat web service in
src/expasy-agent - an experimental MCP server to generate and execute SPARQL queries on SIB resources in
src/expasy-mcp
The system integrates Retrieval-Augmented Generation (RAG) and SPARQL query validation through endpoint schemas, to ensure more accurate and relevant query generation on large scale knowledge graphs.
The components are designed to work either independently or as part of a full chat-based system that can be deployed for a set of SPARQL endpoints. It requires endpoints to include metadata such as SPARQL query examples and endpoint descriptions using the Vocabulary of Interlinked Datasets (VoID), which can be automatically generated using the void-generator.
ð Features
- Metadata Extraction: Functions to extract and load relevant metadata from SPARQL endpoints. These loaders are compatible with LangChain but are flexible enough to be used independently, providing metadata as JSON for custom vector store integration.
- SPARQL Query Validation: A function to automatically parse and validate federated SPARQL queries against the VoID description of the target endpoints.
- Deployable Chat System: A reusable and containerized system for deploying an LLM-based chat service with a web UI, API, and vector database. This system helps users write SPARQL queries by leveraging endpoint metadata (WIP).
- Live Example: Configuration for chat.expasy.org, an LLM-powered chat system supporting SPARQL query generation for endpoints maintained by the SIB.
[!TIP]
You can quickly check if an endpoint contains the expected metadata at sib-swiss.github.io/sparql-editor/check
ðĶïļ Reusable components
Checkout the src/sparql-llm/README.md for more details on how to use the reusable components.
ð§âðŦ Tutorial
There is a step by step tutorial to show how a LLM-based chat system for generating SPARQL queries can be easily built here: https://sib-swiss.github.io/sparql-llm
ð Complete chat system
[!WARNING]
To deploy the complete chat system right now you will need to fork/clone this repository, change the configuration in
src/expasy-agent/src/expasy_agent/config.pyandcompose.yml, then deploy with docker/podman compose.It can easily be adapted to use any LLM served through an OpenAI-compatible API. We plan to make configuration and deployment of complete SPARQL LLM chat system easier in the future, let us know if you are interested in the GitHub issues!
Requirements: Docker, nodejs (to build the frontend), and optionally uv if you want to run scripts outside of docker.
Explore and change the system configuration in
src/expasy-agent/src/expasy_agent/config.pyCreate a
.envfile at the root of the repository to provide secrets and API keys:
```sh CHATAPIKEY=NOTSOSECRETAPIKEYUSEDBYFRONTENDTOAVOIDSPAMFROMCRAWLERS LOGSAPIKEY=SECRETPASSWORDTOEASILYACCESSLOGSTHROUGHTHEAPI
OPENAIAPIKEY=sk-proj-YYY GROQAPIKEY=gskYYY HUGGINGFACEHUBAPITOKEN= TOGETHERAPIKEY= AZUREINFERENCECREDENTIAL= AZUREINFERENCE_ENDPOINT=https://project-id.services.ai.azure.com/models
LANGFUSEHOST=https://cloud.langfuse.com LANGFUSEPUBLICKEY= LANGFUSESECRET_KEY= ```
- Optionally, if you made changes to it, build the chat UI webpage:
sh
cd chat-with-context
npm i
npm run build:demo
cd ..
You can change the UI around the chat in
chat-with-context/demo/index.html
- Start the vector database and web server locally for development, with code from the
srcfolder mounted in the container and automatic API reload on changes to the code:
bash
docker compose -f compose.dev.yml up
- Chat web UI available at http://localhost:8000
- OpenAPI Swagger UI available at http://localhost:8000/docs
- Vector database dashboard UI available at http://localhost:6333/dashboard
In production, you will need to make some changes to the compose.yml file to adapt it to your server/proxy setup:
bash
docker compose up
All data from the containers are stored persistently in the
datafolder (e.g. vectordb indexes)
- When the stack is up you can run the script to index the SPARQL endpoints from within the container (need to do it once):
sh
docker compose exec api uv run src/expasy_agent/indexing/index_resources.py
[!WARNING]
Experimental entities indexing: it can take a lot of time to generate embeddings for millions of entities. So we recommend to run the script to generate embeddings on a machine with GPU (does not need to be a powerful one, but at least with a GPU, checkout fastembed GPU docs to install the GPU drivers and dependencies)
sh docker compose -f compose.dev.yml up vectordb -d cd src/expasy-agent VECTORDB_URL=http://localhost:6334 nohup uv run --extra gpu src/expasy_agent/indexing/index_entities.py --gpu &Then move the entities collection containing the embeddings in
data/qdrant/collections/entitiesbefore starting the stack
There is a benchmarking scripts for the system that will run a list of questions and compare their results to a reference SPARQL queries, with and without query validation, against a list of LLM providers. You will need to change the list of queries if you want to use it for different endpoints. You will need to start the stack in development mode to run it:
sh
uv run --env-file .env src/expasy-agent/tests/benchmark.py
It takes time to run and will log the output and results in
data/benchmarks
Follow these instructions to run the Text2SPARQL Benchmark.
ðŠķ How to cite this work
If you reuse any part of this work, please cite the arXiv paper:
bibtex
@misc{emonet2024llmbasedsparqlquerygeneration,
title={LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs},
author={Vincent Emonet and Jerven Bolleman and Severine Duvaud and Tarcisio Mendes de Farias and Ana Claudia Sima},
year={2024},
eprint={2410.06062},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2410.06062},
}
Owner
- Name: SIB Swiss Institute of Bioinformatics
- Login: sib-swiss
- Kind: organization
- Location: Switzerland
- Website: https://www.sib.swiss
- Repositories: 102
- Profile: https://github.com/sib-swiss
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs"
repository-code: https://github.com/sib-swiss/sparql-llm
date-released: 2024-10-08
doi: 10.48550/arXiv.2410.06062
license: MIT
authors:
- given-names: Vincent
family-names: Emonet
orcid: https://orcid.org/0000-0002-1501-1082
affiliation: SIB Swiss Institute of Bioinformatics
- given-names: Jerven
family-names: Bolleman
orcid: https://orcid.org/0000-0002-7449-1266,
email: Jerven.Bolleman@sib.swiss
affiliation: SIB Swiss Institute of Bioinformatics
- given-names: Severine
family-names: Duvaud
orcid: https://orcid.org/0000-0001-7892-9678
affiliation: SIB Swiss Institute of Bioinformatics
- given-names: Tarcisio
family-names: Mendes de Farias
orcid: https://orcid.org/0000-0002-3175-5372
affiliation: SIB Swiss Institute of Bioinformatics
- given-names: Ana Claudia
family-names: Sima
orcid: https://orcid.org/0000-0003-3213-4495
affiliation: SIB Swiss Institute of Bioinformatics
GitHub Events
Total
- Issues event: 4
- Watch event: 39
- Member event: 1
- Issue comment event: 4
- Push event: 110
- Pull request event: 1
- Fork event: 8
- Create event: 5
Last Year
- Issues event: 4
- Watch event: 39
- Member event: 1
- Issue comment event: 4
- Push event: 110
- Pull request event: 1
- Fork event: 8
- Create event: 5
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Vincent Emonet | v****t@g****m | 228 |
| tarcisiotmf | t****f@g****m | 2 |
| tarcisio_adm | t****m@v****l | 2 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 3
- Total pull requests: 1
- Average time to close issues: 2 days
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 1.33
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 1
- Average time to close issues: 2 days
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 1.33
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- adeslatt (2)
- jjkoehorst (1)
Pull Request Authors
- psmeros (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 151 last-month
- npm 12 last-month
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 24
- Total maintainers: 2
npmjs.org: @sib-swiss/chat-with-context
A web component to easily deploy a chat with context.
- Homepage: https://github.com/sib-swiss/sparql-llm#readme
- License: MIT
-
Latest release: 0.0.17
published 8 months ago
Rankings
Maintainers (2)
pypi.org: sparql-llm
Reusable components and complete chat system to improve Large Language Models (LLMs) capabilities when generating SPARQL queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema.
- Homepage: https://github.com/sib-swiss/sparql-llm
- Documentation: https://github.com/sib-swiss/sparql-llm
- License: MIT License Copyright (c) 2024-present SIB Swiss Institute of Bioinformatics Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-
Latest release: 0.0.8
published about 1 year ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- docker.io/tiangolo/uvicorn-gunicorn-fastapi python3.11 build