sparql-llm

🦜✨ Chat system and reusable components to improve LLMs capabilities when generating SPARQL queries

https://github.com/sib-swiss/sparql-llm

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary

Keywords

expasy llm sparql sparql-query-builder

Last synced: 10 months ago · JSON representation ·

Repository

🦜✨ Chat system and reusable components to improve LLMs capabilities when generating SPARQL queries

Basic Info

Host: GitHub
Owner: sib-swiss
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage: https://chat.expasy.org
Size: 11.3 MB

Statistics

Stars: 56
Watchers: 4
Forks: 9
Open Issues: 2
Releases: 0

Topics

expasy llm sparql sparql-query-builder

Created about 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

# ✨ SPARQL query generation with LLMs 🦜 [![PyPI - Version](https://img.shields.io/pypi/v/sparql-llm.svg?logo=pypi&label=PyPI&logoColor=silver)](https://pypi.org/project/sparql-llm/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sparql-llm.svg?logo=python&label=Python&logoColor=silver)](https://pypi.org/project/sparql-llm/) [![Tests](https://github.com/sib-swiss/sparql-llm/actions/workflows/test.yml/badge.svg)](https://github.com/sib-swiss/sparql-llm/actions/workflows/test.yml)

This project provides tools to enhance the capabilities of Large Language Models (LLMs) in generating SPARQL queries for specific endpoints:

reusable components in src/sparql-llm and published as the sparql-llm pip package
a complete chat web service in src/expasy-agent
an experimental MCP server to generate and execute SPARQL queries on SIB resources in src/expasy-mcp

The system integrates Retrieval-Augmented Generation (RAG) and SPARQL query validation through endpoint schemas, to ensure more accurate and relevant query generation on large scale knowledge graphs.

The components are designed to work either independently or as part of a full chat-based system that can be deployed for a set of SPARQL endpoints. It requires endpoints to include metadata such as SPARQL query examples and endpoint descriptions using the Vocabulary of Interlinked Datasets (VoID), which can be automatically generated using the void-generator.

🌈 Features

Metadata Extraction: Functions to extract and load relevant metadata from SPARQL endpoints. These loaders are compatible with LangChain but are flexible enough to be used independently, providing metadata as JSON for custom vector store integration.
SPARQL Query Validation: A function to automatically parse and validate federated SPARQL queries against the VoID description of the target endpoints.
Deployable Chat System: A reusable and containerized system for deploying an LLM-based chat service with a web UI, API, and vector database. This system helps users write SPARQL queries by leveraging endpoint metadata (WIP).
Live Example: Configuration for chat.expasy.org, an LLM-powered chat system supporting SPARQL query generation for endpoints maintained by the SIB.

[!TIP]

You can quickly check if an endpoint contains the expected metadata at sib-swiss.github.io/sparql-editor/check

📦️ Reusable components

Checkout the src/sparql-llm/README.md for more details on how to use the reusable components.

🧑‍🏫 Tutorial

There is a step by step tutorial to show how a LLM-based chat system for generating SPARQL queries can be easily built here: https://sib-swiss.github.io/sparql-llm

🚀 Complete chat system

[!WARNING]

To deploy the complete chat system right now you will need to fork/clone this repository, change the configuration in src/expasy-agent/src/expasy_agent/config.py and compose.yml, then deploy with docker/podman compose.

It can easily be adapted to use any LLM served through an OpenAI-compatible API. We plan to make configuration and deployment of complete SPARQL LLM chat system easier in the future, let us know if you are interested in the GitHub issues!

Requirements: Docker, nodejs (to build the frontend), and optionally uv if you want to run scripts outside of docker.

Explore and change the system configuration in src/expasy-agent/src/expasy_agent/config.py
Create a .env file at the root of the repository to provide secrets and API keys:

```sh CHATAPIKEY=NOTSOSECRETAPIKEYUSEDBYFRONTENDTOAVOIDSPAMFROMCRAWLERS LOGSAPIKEY=SECRETPASSWORDTOEASILYACCESSLOGSTHROUGHTHEAPI

OPENAIAPIKEY=sk-proj-YYY GROQAPIKEY=gskYYY HUGGINGFACEHUBAPITOKEN= TOGETHERAPIKEY= AZUREINFERENCECREDENTIAL= AZUREINFERENCE_ENDPOINT=https://project-id.services.ai.azure.com/models

LANGFUSEHOST=https://cloud.langfuse.com LANGFUSEPUBLICKEY= LANGFUSESECRET_KEY= ```

Optionally, if you made changes to it, build the chat UI webpage:

sh cd chat-with-context npm i npm run build:demo cd ..

You can change the UI around the chat in chat-with-context/demo/index.html

Start the vector database and web server locally for development, with code from the src folder mounted in the container and automatic API reload on changes to the code:

bash docker compose -f compose.dev.yml up

Chat web UI available at http://localhost:8000
OpenAPI Swagger UI available at http://localhost:8000/docs
Vector database dashboard UI available at http://localhost:6333/dashboard

In production, you will need to make some changes to the compose.yml file to adapt it to your server/proxy setup:

bash docker compose up

All data from the containers are stored persistently in the data folder (e.g. vectordb indexes)

When the stack is up you can run the script to index the SPARQL endpoints from within the container (need to do it once):

sh docker compose exec api uv run src/expasy_agent/indexing/index_resources.py

[!WARNING]

Experimental entities indexing: it can take a lot of time to generate embeddings for millions of entities. So we recommend to run the script to generate embeddings on a machine with GPU (does not need to be a powerful one, but at least with a GPU, checkout fastembed GPU docs to install the GPU drivers and dependencies)

sh docker compose -f compose.dev.yml up vectordb -d cd src/expasy-agent VECTORDB_URL=http://localhost:6334 nohup uv run --extra gpu src/expasy_agent/indexing/index_entities.py --gpu &

Then move the entities collection containing the embeddings in data/qdrant/collections/entities before starting the stack

There is a benchmarking scripts for the system that will run a list of questions and compare their results to a reference SPARQL queries, with and without query validation, against a list of LLM providers. You will need to change the list of queries if you want to use it for different endpoints. You will need to start the stack in development mode to run it:

sh uv run --env-file .env src/expasy-agent/tests/benchmark.py

It takes time to run and will log the output and results in data/benchmarks

Follow these instructions to run the Text2SPARQL Benchmark.

🪶 How to cite this work

If you reuse any part of this work, please cite the arXiv paper:

bibtex @misc{emonet2024llmbasedsparqlquerygeneration, title={LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs}, author={Vincent Emonet and Jerven Bolleman and Severine Duvaud and Tarcisio Mendes de Farias and Ana Claudia Sima}, year={2024}, eprint={2410.06062}, archivePrefix={arXiv}, primaryClass={cs.DB}, url={https://arxiv.org/abs/2410.06062}, }

Owner

Name: SIB Swiss Institute of Bioinformatics
Login: sib-swiss
Kind: organization
Location: Switzerland

Website: https://www.sib.swiss
Repositories: 102
Profile: https://github.com/sib-swiss

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs"
repository-code: https://github.com/sib-swiss/sparql-llm
date-released: 2024-10-08
doi: 10.48550/arXiv.2410.06062
license: MIT
authors:
  - given-names: Vincent
    family-names: Emonet
    orcid: https://orcid.org/0000-0002-1501-1082
    affiliation: SIB Swiss Institute of Bioinformatics
  - given-names: Jerven
    family-names: Bolleman
    orcid: https://orcid.org/0000-0002-7449-1266,
    email: Jerven.Bolleman@sib.swiss
    affiliation: SIB Swiss Institute of Bioinformatics
  - given-names: Severine
    family-names: Duvaud
    orcid: https://orcid.org/0000-0001-7892-9678
    affiliation: SIB Swiss Institute of Bioinformatics
  - given-names: Tarcisio
    family-names: Mendes de Farias
    orcid: https://orcid.org/0000-0002-3175-5372
    affiliation: SIB Swiss Institute of Bioinformatics
  - given-names: Ana Claudia
    family-names: Sima
    orcid: https://orcid.org/0000-0003-3213-4495
    affiliation: SIB Swiss Institute of Bioinformatics

GitHub Events

Total

Issues event: 4
Watch event: 39
Member event: 1
Issue comment event: 4
Push event: 110
Pull request event: 1
Fork event: 8
Create event: 5

Last Year

Issues event: 4
Watch event: 39
Member event: 1
Issue comment event: 4
Push event: 110
Pull request event: 1
Fork event: 8
Create event: 5

Committers

Last synced: 12 months ago

All Time

Total Commits: 232
Total Committers: 3
Avg Commits per committer: 77.333
Development Distribution Score (DDS): 0.017

Past Year

Commits: 157
Committers: 3
Avg Commits per committer: 52.333
Development Distribution Score (DDS): 0.025

Top Committers

Name	Email	Commits
Vincent Emonet	v**t@g**m	228
tarcisiotmf	t**f@g**m	2
tarcisio_adm	t**m@v**l	2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 3
Total pull requests: 1
Average time to close issues: 2 days
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 1.33
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 1
Average time to close issues: 2 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 1
Average comments per issue: 1.33
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

adeslatt (2)
jjkoehorst (1)

Pull Request Authors

psmeros (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 151 last-month
- npm 12 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 24
Total maintainers: 2

npmjs.org: @sib-swiss/chat-with-context

A web component to easily deploy a chat with context.

Homepage: https://github.com/sib-swiss/sparql-llm#readme
License: MIT
Latest release: 0.0.17
published 12 months ago

Versions: 16
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 12 Last month

Rankings

Dependent repos count: 25.4%

Average: 31.0%

Dependent packages count: 36.7%

Maintainers (2)

vemonet it_support_sib_swiss

Last synced: 10 months ago

pypi.org: sparql-llm

Reusable components and complete chat system to improve Large Language Models (LLMs) capabilities when generating SPARQL queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema.

Homepage: https://github.com/sib-swiss/sparql-llm
Documentation: https://github.com/sib-swiss/sparql-llm
License: MIT License Copyright (c) 2024-present SIB Swiss Institute of Bioinformatics Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Latest release: 0.0.8
published over 1 year ago

Versions: 8
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 151 Last month

Rankings

Dependent packages count: 10.3%

Average: 34.2%

Dependent repos count: 58.1%

Maintainers (1)

vemonet

Last synced: 10 months ago

Dependencies

.github/workflows/test.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

Dockerfile docker

docker.io/tiangolo/uvicorn-gunicorn-fastapi python3.11 build

pyproject.toml pypi