https://github.com/citiususc/text2shacl
Automatic Extraction of SHACL shapes from Text using LLMs
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary
Keywords
Repository
Automatic Extraction of SHACL shapes from Text using LLMs
Basic Info
Statistics
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Automatic Constraint Extraction for Knowledge Graphs Using Large Language Models
This project focuses on the automatic extraction of SHACL constraints from textual guides using Large Language Models (LLMs). By leveraging advanced prompt techniques and language models like LLaMA or GPT, it aims to generate accurate SHACL shapes for knowledge graphs, facilitating ontology validation and ensuring data consistency. The system supports various prompting strategies and provides a multi-agent framework to improve constraint extraction quality.
Project Structure
chroma/
Stores the vector embeddings generated from the PDF guide fragments (if you have already processed them).content/
Contains the source guide, both in PDF and in HTML format. The final version for use isRINF_Application_guide_V1.6.1.html.output/
Contains all generated SHACL constraints in Turtle (.ttl) format. Filenames follow the pattern:
{filename}_{model}_{temperature}_{prompting_technique}.ttl.plots/
Includes the code needed to generate plots and statistics related to the experiments.prompts/
Holds all prompt templates used during the project, organized in JSON files by prompting technique.validation/
Contains code and data used to validate the generated SHACL constraints.auxiliary_ontology_functions.py
Utility functions for ontology processing.cloudflare.py
Helper functions for interacting with Cloudflare R2.era-shapes.ttl
Gold standard of SHACL shapes that serves as the expected target.main.py
The main entry point to run the project.multiagent.py
Implements a multi-agent system to generate constraints collaboratively.ollama_functions.py
Helper functions to interface with the Ollama server.ontology.ttl
Base ontology used as the starting point to generate constraints.preprocess_html.py
Script to preprocess the HTML file converted from the PDF. (There is no need to run it again since the final HTML has already been generated.)prompts.py
Includes a function to load prompt templates from JSON files.rag.py
Implements the RAG (Retrieval-Augmented Generation) technique to build a document retriever.requirements.txt
Project dependencies.run_experiments.sh
Shell script to execute the entire pipeline, including experiments.
Requirements
To install all dependencies and get the project running:
bash
pip install -r requirements.txt
In addition, the following conditions and configurations are required:
Running Redis server
An active Redis server is essential for the proper functioning of certain modules in the project..envconfiguration file
A.envfile must be created containing the following environment variables, required for interacting with Cloudflare R2:ACCOUNT_ID: Cloudflare account identifier.R2_ACCESS_KEY: Cloudflare R2 access key.R2_SECRET_KEY: Cloudflare R2 secret key.R2_BUCKET: Name of the R2 bucket, which must be set to public.PUB_URL: Public URL of the R2 bucket.
Ollama server (for the open-source version)
If using the open-source version of the system, an active Ollama server is required, and thellama3:8bmodel must be downloaded.OpenAI API (optional)
If you prefer to use the OpenAI API, the API key must be added to your shell environment. This can be done by adding the following line to your~/.bashrcfile:
bash
export OPENAI_API_KEY=your_openai_api_key_here
To validate the constraints against specific RDF data, the file ES.zip_combined-new.nq is used.
This file is not included in the repository due to security and privacy reasons.
If you need access to it, please request it by email at:
📧 adrian.martinez.balea@rai.usc.es
Scripts Usage
Running the Full Experiment Pipeline
To run the entire experimentation process, use:
bash
chmod +x run_experiments.sh
./run_experiments.sh
Running a Single Extraction Execution
You can also run a single extraction using the main script with the following command-line arguments:
bash
python3 main.py <file_path> [options]
Arguments
| Argument | Description | Default | Options |
|------------------------|---------------------------------------------------------------------------------------------|---------|----------------------------------------------|
| file | Path to the text or PDF file to be processed. | N/A | N/A |
| --force_process | Forces reprocessing of the PDF even if it has been processed before. | False | Flag (no value needed) |
| --model | LLM model to use for constraint extraction. | llama | llama, gpt |
| --temperature | Temperature setting for the LLM, controls randomness in generation. | 0 | Any float value |
| --prompting_technique| Prompting technique to use for the LLM query. | basic | v1, basic, few-shot, cot, grounded-citing, all |
Example Usage
bash
python3 main.py content/RINF_Application_guide_V1.6.1.html --model gpt --temperature 0.5 --prompting_technique few-shot
Owner
- Name: CiTIUS
- Login: citiususc
- Kind: organization
- Email: citius@usc.es
- Location: Santiago de Compostela
- Website: https://citius.gal
- Twitter: citiususc
- Repositories: 49
- Profile: https://github.com/citiususc
Centro Singular de Investigación en Tecnoloxías Intelixenteas da Universidade de Santiago de Compostela
GitHub Events
Total
- Watch event: 1
- Push event: 1
- Public event: 1
Last Year
- Watch event: 1
- Push event: 1
- Public event: 1
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| adrianmbalea | a****a@g****m | 3 |
| David Chaves | d****a@g****m | 1 |
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- Deprecated ==1.2.18
- Jinja2 ==3.1.6
- Markdown ==3.7
- MarkupSafe ==3.0.2
- PyPika ==0.48.9
- PyYAML ==6.0.2
- Pygments ==2.19.1
- RapidFuzz ==3.12.2
- SQLAlchemy ==2.0.38
- XlsxWriter ==3.2.2
- aiofiles ==24.1.0
- aiohappyeyeballs ==2.5.0
- aiohttp ==3.11.13
- aiosignal ==1.3.2
- annotated-types ==0.7.0
- antlr4-python3-runtime ==4.9.3
- anyio ==4.8.0
- asgiref ==3.8.1
- asttokens ==3.0.0
- attrs ==25.1.0
- backoff ==2.2.1
- bcrypt ==4.3.0
- beautifulsoup4 ==4.13.3
- boto3 ==1.38.36
- botocore ==1.38.36
- build ==1.2.2.post1
- cachetools ==5.5.2
- certifi ==2025.1.31
- cffi ==1.17.1
- chardet ==5.2.0
- charset-normalizer ==3.4.1
- chroma-hnswlib ==0.7.6
- chromadb ==0.6.3
- click ==8.1.8
- coloredlogs ==15.0.1
- comm ==0.2.2
- contourpy ==1.3.1
- cryptography ==44.0.2
- cycler ==0.12.1
- dataclasses-json ==0.6.7
- debugpy ==1.8.14
- decorator ==5.2.1
- distro ==1.9.0
- durationpy ==0.9
- effdet ==0.4.1
- emoji ==2.14.1
- et_xmlfile ==2.0.0
- eval_type_backport ==0.2.2
- executing ==2.2.0
- fastapi ==0.115.12
- filelock ==3.17.0
- filetype ==1.2.0
- flatbuffers ==25.2.10
- fonttools ==4.56.0
- frozenlist ==1.5.0
- fsspec ==2025.3.0
- google-api-core ==2.24.1
- google-auth ==2.38.0
- google-cloud-vision ==3.10.0
- googleapis-common-protos ==1.69.1
- greenlet ==3.1.1
- groq ==0.18.0
- grpcio ==1.71.0rc2
- grpcio-status ==1.71.0rc2
- h11 ==0.14.0
- hf-xet ==1.1.5
- html5lib ==1.1
- html5rdf ==1.2.1
- httpcore ==1.0.7
- httptools ==0.6.4
- httpx ==0.28.1
- httpx-sse ==0.4.0
- huggingface-hub ==0.33.1
- humanfriendly ==10.0
- idna ==3.10
- importlib_metadata ==8.6.1
- importlib_resources ==6.5.2
- ipykernel ==6.29.5
- ipython ==9.2.0
- ipython_pygments_lexers ==1.1.1
- jedi ==0.19.2
- jiter ==0.8.2
- jmespath ==1.0.1
- joblib ==1.4.2
- jsonpatch ==1.33
- jsonpointer ==3.0.0
- jupyter_client ==8.6.3
- jupyter_core ==5.7.2
- kiwisolver ==1.4.8
- kubernetes ==32.0.1
- langchain ==0.3.20
- langchain-chroma ==0.2.2
- langchain-community ==0.3.19
- langchain-core ==0.3.66
- langchain-groq ==0.2.5
- langchain-huggingface ==0.3.0
- langchain-ollama ==0.3.3
- langchain-openai ==0.3.8
- langchain-text-splitters ==0.3.6
- langdetect ==1.0.9
- langgraph ==0.3.31
- langgraph-checkpoint ==2.0.24
- langgraph-prebuilt ==0.1.8
- langgraph-sdk ==0.1.63
- langsmith ==0.4.3
- lxml ==5.3.1
- markdown-it-py ==3.0.0
- marshmallow ==3.26.1
- matplotlib ==3.10.1
- matplotlib-inline ==0.1.7
- mdurl ==0.1.2
- mmh3 ==5.1.0
- monotonic ==1.6
- mpmath ==1.3.0
- multidict ==6.1.0
- mypy-extensions ==1.0.0
- nest-asyncio ==1.6.0
- networkx ==3.4.2
- nltk ==3.9.1
- numpy ==1.26.4
- nvidia-cublas-cu12 ==12.4.5.8
- nvidia-cuda-cupti-cu12 ==12.4.127
- nvidia-cuda-nvrtc-cu12 ==12.4.127
- nvidia-cuda-runtime-cu12 ==12.4.127
- nvidia-cudnn-cu12 ==9.1.0.70
- nvidia-cufft-cu12 ==11.2.1.3
- nvidia-curand-cu12 ==10.3.5.147
- nvidia-cusolver-cu12 ==11.6.1.9
- nvidia-cusparse-cu12 ==12.3.1.170
- nvidia-cusparselt-cu12 ==0.6.2
- nvidia-nccl-cu12 ==2.21.5
- nvidia-nvjitlink-cu12 ==12.4.127
- nvidia-nvtx-cu12 ==12.4.127
- oauthlib ==3.2.2
- olefile ==0.47
- ollama ==0.5.1
- omegaconf ==2.3.0
- onnx ==1.17.0
- onnxruntime ==1.21.0
- openai ==1.65.5
- opencv-python ==4.11.0.86
- openpyxl ==3.1.5
- opentelemetry-api ==1.31.1
- opentelemetry-exporter-otlp-proto-common ==1.31.1
- opentelemetry-exporter-otlp-proto-grpc ==1.31.1
- opentelemetry-instrumentation ==0.52b1
- opentelemetry-instrumentation-asgi ==0.52b1
- opentelemetry-instrumentation-fastapi ==0.52b1
- opentelemetry-proto ==1.31.1
- opentelemetry-sdk ==1.31.1
- opentelemetry-semantic-conventions ==0.52b1
- opentelemetry-util-http ==0.52b1
- orjson ==3.10.15
- ormsgpack ==1.9.1
- overrides ==7.7.0
- owlrl ==7.1.3
- packaging ==24.2
- pandas ==2.2.3
- parso ==0.8.4
- pdf2image ==1.17.0
- pdfminer.six ==20240706
- pexpect ==4.9.0
- pi_heif ==0.21.0
- pikepdf ==9.5.2
- pillow ==11.1.0
- platformdirs ==4.3.7
- posthog ==3.21.0
- prettytable ==3.16.0
- prompt_toolkit ==3.0.51
- propcache ==0.3.0
- proto-plus ==1.26.0
- protobuf ==5.29.3
- psutil ==7.0.0
- ptyprocess ==0.7.0
- pure_eval ==0.2.3
- pyasn1 ==0.6.1
- pyasn1_modules ==0.4.1
- pycocotools ==2.0.8
- pycparser ==2.22
- pydantic ==2.10.6
- pydantic-settings ==2.8.1
- pydantic_core ==2.27.2
- pypandoc ==1.15
- pyparsing ==3.2.1
- pypdf ==5.3.1
- pypdfium2 ==4.30.1
- pyproject_hooks ==1.2.0
- pyshacl ==0.30.1
- python-dateutil ==2.9.0.post0
- python-docx ==1.1.2
- python-dotenv ==1.0.1
- python-iso639 ==2025.2.18
- python-magic ==0.4.27
- python-multipart ==0.0.20
- python-oxmsg ==0.0.2
- python-pptx ==1.0.2
- pytz ==2025.1
- pyzmq ==26.4.0
- rdflib ==7.1.4
- redis ==6.0.0
- regex ==2024.11.6
- requests ==2.32.3
- requests-oauthlib ==2.0.0
- requests-toolbelt ==1.0.0
- rich ==13.9.4
- rsa ==4.9
- s3transfer ==0.13.0
- safetensors ==0.5.3
- scipy ==1.15.2
- seaborn ==0.13.2
- setuptools ==76.0.0
- shellingham ==1.5.4
- six ==1.17.0
- sniffio ==1.3.1
- soupsieve ==2.6
- stack-data ==0.6.3
- starlette ==0.46.1
- sympy ==1.13.1
- tenacity ==9.0.0
- tiktoken ==0.9.0
- timm ==1.0.15
- tokenizers ==0.21.0
- torch ==2.6.0
- torchvision ==0.21.0
- tornado ==6.4.2
- tqdm ==4.67.1
- traitlets ==5.14.3
- transformers ==4.49.0
- triton ==3.2.0
- typer ==0.15.2
- typing-inspect ==0.9.0
- typing-inspection ==0.4.0
- typing_extensions ==4.12.2
- tzdata ==2025.1
- unstructured ==0.16.25
- unstructured-client ==0.31.1
- unstructured-inference ==0.8.9
- unstructured.pytesseract ==0.3.15
- urllib3 ==2.3.0
- uvicorn ==0.34.0
- uvloop ==0.21.0
- watchfiles ==1.0.4
- wcwidth ==0.2.13
- webencodings ==0.5.1
- websocket-client ==1.8.0
- websockets ==15.0.1
- wrapt ==1.17.2
- xlrd ==2.0.1
- xxhash ==3.5.0
- yarl ==1.18.3
- zipp ==3.21.0
- zstandard ==0.23.0