sparql-llm

ðŸĶœâœĻ Chat system and reusable components to improve LLMs capabilities when generating SPARQL queries

https://github.com/sib-swiss/sparql-llm

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • ✓
    CITATION.cff file
    Found CITATION.cff file
  • ✓
    codemeta.json file
    Found codemeta.json file
  • ✓
    .zenodo.json file
    Found .zenodo.json file
  • ○
    DOI references
  • ✓
    Academic publication links
    Links to: arxiv.org
  • ○
    Committers with academic emails
  • ○
    Institutional organization owner
  • ○
    JOSS paper metadata
  • ○
    Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary

Keywords

expasy llm sparql sparql-query-builder
Last synced: 6 months ago · JSON representation ·

Repository

ðŸĶœâœĻ Chat system and reusable components to improve LLMs capabilities when generating SPARQL queries

Basic Info
  • Host: GitHub
  • Owner: sib-swiss
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage: https://chat.expasy.org
  • Size: 11.3 MB
Statistics
  • Stars: 56
  • Watchers: 4
  • Forks: 9
  • Open Issues: 2
  • Releases: 0
Topics
expasy llm sparql sparql-query-builder
Created almost 2 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

# âœĻ SPARQL query generation with LLMs ðŸĶœ [![PyPI - Version](https://img.shields.io/pypi/v/sparql-llm.svg?logo=pypi&label=PyPI&logoColor=silver)](https://pypi.org/project/sparql-llm/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sparql-llm.svg?logo=python&label=Python&logoColor=silver)](https://pypi.org/project/sparql-llm/) [![Tests](https://github.com/sib-swiss/sparql-llm/actions/workflows/test.yml/badge.svg)](https://github.com/sib-swiss/sparql-llm/actions/workflows/test.yml)

This project provides tools to enhance the capabilities of Large Language Models (LLMs) in generating SPARQL queries for specific endpoints:

  • reusable components in src/sparql-llm and published as the sparql-llm pip package
  • a complete chat web service in src/expasy-agent
  • an experimental MCP server to generate and execute SPARQL queries on SIB resources in src/expasy-mcp

The system integrates Retrieval-Augmented Generation (RAG) and SPARQL query validation through endpoint schemas, to ensure more accurate and relevant query generation on large scale knowledge graphs.

The components are designed to work either independently or as part of a full chat-based system that can be deployed for a set of SPARQL endpoints. It requires endpoints to include metadata such as SPARQL query examples and endpoint descriptions using the Vocabulary of Interlinked Datasets (VoID), which can be automatically generated using the void-generator.

🌈 Features

  • Metadata Extraction: Functions to extract and load relevant metadata from SPARQL endpoints. These loaders are compatible with LangChain but are flexible enough to be used independently, providing metadata as JSON for custom vector store integration.
  • SPARQL Query Validation: A function to automatically parse and validate federated SPARQL queries against the VoID description of the target endpoints.
  • Deployable Chat System: A reusable and containerized system for deploying an LLM-based chat service with a web UI, API, and vector database. This system helps users write SPARQL queries by leveraging endpoint metadata (WIP).
  • Live Example: Configuration for chat.expasy.org, an LLM-powered chat system supporting SPARQL query generation for endpoints maintained by the SIB.

[!TIP]

You can quickly check if an endpoint contains the expected metadata at sib-swiss.github.io/sparql-editor/check

ðŸ“Ķïļ Reusable components

Checkout the src/sparql-llm/README.md for more details on how to use the reusable components.

🧑‍ðŸŦ Tutorial

There is a step by step tutorial to show how a LLM-based chat system for generating SPARQL queries can be easily built here: https://sib-swiss.github.io/sparql-llm

🚀 Complete chat system

[!WARNING]

To deploy the complete chat system right now you will need to fork/clone this repository, change the configuration in src/expasy-agent/src/expasy_agent/config.py and compose.yml, then deploy with docker/podman compose.

It can easily be adapted to use any LLM served through an OpenAI-compatible API. We plan to make configuration and deployment of complete SPARQL LLM chat system easier in the future, let us know if you are interested in the GitHub issues!

Requirements: Docker, nodejs (to build the frontend), and optionally uv if you want to run scripts outside of docker.

  1. Explore and change the system configuration in src/expasy-agent/src/expasy_agent/config.py

  2. Create a .env file at the root of the repository to provide secrets and API keys:

```sh CHATAPIKEY=NOTSOSECRETAPIKEYUSEDBYFRONTENDTOAVOIDSPAMFROMCRAWLERS LOGSAPIKEY=SECRETPASSWORDTOEASILYACCESSLOGSTHROUGHTHEAPI

OPENAIAPIKEY=sk-proj-YYY GROQAPIKEY=gskYYY HUGGINGFACEHUBAPITOKEN= TOGETHERAPIKEY= AZUREINFERENCECREDENTIAL= AZUREINFERENCE_ENDPOINT=https://project-id.services.ai.azure.com/models

LANGFUSEHOST=https://cloud.langfuse.com LANGFUSEPUBLICKEY= LANGFUSESECRET_KEY= ```

  1. Optionally, if you made changes to it, build the chat UI webpage:

sh cd chat-with-context npm i npm run build:demo cd ..

You can change the UI around the chat in chat-with-context/demo/index.html

  1. Start the vector database and web server locally for development, with code from the src folder mounted in the container and automatic API reload on changes to the code:

bash docker compose -f compose.dev.yml up

  • Chat web UI available at http://localhost:8000
  • OpenAPI Swagger UI available at http://localhost:8000/docs
  • Vector database dashboard UI available at http://localhost:6333/dashboard

In production, you will need to make some changes to the compose.yml file to adapt it to your server/proxy setup:

bash docker compose up

All data from the containers are stored persistently in the data folder (e.g. vectordb indexes)

  1. When the stack is up you can run the script to index the SPARQL endpoints from within the container (need to do it once):

sh docker compose exec api uv run src/expasy_agent/indexing/index_resources.py

[!WARNING]

Experimental entities indexing: it can take a lot of time to generate embeddings for millions of entities. So we recommend to run the script to generate embeddings on a machine with GPU (does not need to be a powerful one, but at least with a GPU, checkout fastembed GPU docs to install the GPU drivers and dependencies)

sh docker compose -f compose.dev.yml up vectordb -d cd src/expasy-agent VECTORDB_URL=http://localhost:6334 nohup uv run --extra gpu src/expasy_agent/indexing/index_entities.py --gpu &

Then move the entities collection containing the embeddings in data/qdrant/collections/entities before starting the stack

There is a benchmarking scripts for the system that will run a list of questions and compare their results to a reference SPARQL queries, with and without query validation, against a list of LLM providers. You will need to change the list of queries if you want to use it for different endpoints. You will need to start the stack in development mode to run it:

sh uv run --env-file .env src/expasy-agent/tests/benchmark.py

It takes time to run and will log the output and results in data/benchmarks

Follow these instructions to run the Text2SPARQL Benchmark.

ðŸŠķ How to cite this work

If you reuse any part of this work, please cite the arXiv paper:

bibtex @misc{emonet2024llmbasedsparqlquerygeneration, title={LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs}, author={Vincent Emonet and Jerven Bolleman and Severine Duvaud and Tarcisio Mendes de Farias and Ana Claudia Sima}, year={2024}, eprint={2410.06062}, archivePrefix={arXiv}, primaryClass={cs.DB}, url={https://arxiv.org/abs/2410.06062}, }

Owner

  • Name: SIB Swiss Institute of Bioinformatics
  • Login: sib-swiss
  • Kind: organization
  • Location: Switzerland

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs"
repository-code: https://github.com/sib-swiss/sparql-llm
date-released: 2024-10-08
doi: 10.48550/arXiv.2410.06062
license: MIT
authors:
  - given-names: Vincent
    family-names: Emonet
    orcid: https://orcid.org/0000-0002-1501-1082
    affiliation: SIB Swiss Institute of Bioinformatics
  - given-names: Jerven
    family-names: Bolleman
    orcid: https://orcid.org/0000-0002-7449-1266,
    email: Jerven.Bolleman@sib.swiss
    affiliation: SIB Swiss Institute of Bioinformatics
  - given-names: Severine
    family-names: Duvaud
    orcid: https://orcid.org/0000-0001-7892-9678
    affiliation: SIB Swiss Institute of Bioinformatics
  - given-names: Tarcisio
    family-names: Mendes de Farias
    orcid: https://orcid.org/0000-0002-3175-5372
    affiliation: SIB Swiss Institute of Bioinformatics
  - given-names: Ana Claudia
    family-names: Sima
    orcid: https://orcid.org/0000-0003-3213-4495
    affiliation: SIB Swiss Institute of Bioinformatics

GitHub Events

Total
  • Issues event: 4
  • Watch event: 39
  • Member event: 1
  • Issue comment event: 4
  • Push event: 110
  • Pull request event: 1
  • Fork event: 8
  • Create event: 5
Last Year
  • Issues event: 4
  • Watch event: 39
  • Member event: 1
  • Issue comment event: 4
  • Push event: 110
  • Pull request event: 1
  • Fork event: 8
  • Create event: 5

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 232
  • Total Committers: 3
  • Avg Commits per committer: 77.333
  • Development Distribution Score (DDS): 0.017
Past Year
  • Commits: 157
  • Committers: 3
  • Avg Commits per committer: 52.333
  • Development Distribution Score (DDS): 0.025
Top Committers
Name Email Commits
Vincent Emonet v****t@g****m 228
tarcisiotmf t****f@g****m 2
tarcisio_adm t****m@v****l 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 3
  • Total pull requests: 1
  • Average time to close issues: 2 days
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 1.33
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 1
  • Average time to close issues: 2 days
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 1.33
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • adeslatt (2)
  • jjkoehorst (1)
Pull Request Authors
  • psmeros (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 151 last-month
    • npm 12 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 24
  • Total maintainers: 2
npmjs.org: @sib-swiss/chat-with-context

A web component to easily deploy a chat with context.

  • Versions: 16
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 12 Last month
Rankings
Dependent repos count: 25.4%
Average: 31.0%
Dependent packages count: 36.7%
Maintainers (2)
Last synced: 6 months ago
pypi.org: sparql-llm

Reusable components and complete chat system to improve Large Language Models (LLMs) capabilities when generating SPARQL queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema.

  • Homepage: https://github.com/sib-swiss/sparql-llm
  • Documentation: https://github.com/sib-swiss/sparql-llm
  • License: MIT License Copyright (c) 2024-present SIB Swiss Institute of Bioinformatics Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  • Latest release: 0.0.8
    published about 1 year ago
  • Versions: 8
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 151 Last month
Rankings
Dependent packages count: 10.3%
Average: 34.2%
Dependent repos count: 58.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/test.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
Dockerfile docker
  • docker.io/tiangolo/uvicorn-gunicorn-fastapi python3.11 build
pyproject.toml pypi