https://github.com/aksw/aruqula

ARUQULA: An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities - based on SPINACH

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.2%) to scientific vocabulary

Keywords

kgqa rdf react sparql spinach text2sparql

Last synced: 5 months ago · JSON representation

Repository

ARUQULA: An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities - based on SPINACH

Basic Info

Host: GitHub
Owner: AKSW
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 13.7 MB

Statistics

Stars: 0
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Fork of stanford-oval/spinach

Topics

kgqa rdf react sparql spinach text2sparql

Created 8 months ago · Last pushed 7 months ago

https://github.com/AKSW/ARUQULA/blob/main/

SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions

Online Chatbot: https://spinach.genie.stanford.edu

# About **The SPINACH dataset**: Current KBQA datasets lack real-world complexity. The SPINACH KBQA dataset, collected from Wikidata's Request a Query sites, is the first to cover both natural questions and complex SPARQLs **The SPINACH agent**: The SPINACH agent is a new KBQA approach that mimics expert human SPARQL writing, achieving SOTA on many KBQA datasets. You can try it at https://spinach.genie.stanford.edu For more details, check out this blog post on [Wikimedia Research Newsletter](https://meta.m.wikimedia.org/wiki/Research:Newsletter/2024/November). # Folder Structure `datasets/` contains all prior dataset files. Predictions for the SPINACH agent used in the paper can be found at: - `datasets/qald_7_task4/spinach_output_test.json` for QALD-7 - `datasets/qald_9_plus/en/spinach_output_test.json` for QALD-9-plus - `datasets/qald_10/en/spinach_output_test.json` for QALD-10 full set (the prediction for the ToG subset can be retrieved by uncommenting the portion using `get_tog_baseline_questions` in `evaluate_file.py`) - `datasets/wikiwebquestions/spinach_output_dev.json` and `datasets/wikiwebquestions/spinach_output_test.json` for WikiWebQuestions `spinach_dataset/` contains the dev and test set of the SPINACH dataset. The SPINACH agent's outputs are also stored in this directory. `spinach_agent/` contains the implementation for the SPINACH agent. `notebooks/` stores various Jupyter notebooks used to crawl the initial conversations and compute dataset complexity metrics. `tasks/` stores the files declaring how to use the `invoke` command. `tests/` contains all tests, which use `pytest`. You can run all tests by running `invoke tests`. `test_eval.py`, which stores test cases for the row-major F1 implementation, can be run via `python tests/test_eval.py`. # Running the SPINACH agent and evaluating results ## Set up environment Run `conda env create -f conda_env.yaml`. Create a file called `API_KEYS` and write various API keys inside. The format is one key per line, for example `OPENAI_API_KEY=sk-...` ## Run SPINACH parser and evaluate ``` inv evaluate-parser --parser-type part_to_whole --subsample=-1 --engine=gpt-4o --dataset=datasets/qald_10/en/test.json --output-file=datasets/qald_10/en/spinach_output_test.json --regex-use-select-distinct-and-id-not-label --llm-extract-prediction-if-null ``` The two flags at the end are for: - `llm-extract-prediction-if-null`: If a reasoning chain ended without any predicted SPARQL, asks a LLM to return a SPARQL. This part is implemented inside `extract_sparql.ainvoke`. This is helpful because for simple queries, LLMs could just use ``get_wikidata_entry'' to get results instead of ever writing a SPARQL. We enabled this flag for all datasets we evaluated on. - `--regex-use-select-distinct-and-id-not-label`: Attempts to use regex to force use `SELECT DISTINCT` instead of `SELECT`, and try to always include the variable QID instead of the label (i.e., use `x` instead of `xLabel`). We enabled this for all datasets except the new SPINACH dataset that we evaluated on (The SPINACH dataset involves more complex predicted SPARQLs. The regex is not sophisticated enough to handle these cases.) The script will also write a `.log` file with SPINACH's chain of reasonings and actions with the same file name as the `.json` output. You can re-evaluate the output simply from the `.json` file: ``` python spinach_agent/evaluate_file.py --input datasets/qald_10/en/spinach_output_test.json ``` If you'd like to simply run the parser on a list of questions, use the following code from `evaluate_parser.py`: ```python from spinach_agent.part_to_whole_parser import PartToWholeParser semantic_parser_class = PartToWholeParser semantic_parser_class.initialize(engine=args.engine) # e.g. "gpt-4o" chain_output = semantic_parser_class.run_batch( questions, # this should be a dict of {"question": "...", "conversation_history": [...]}, conversation_history can be empty list if running on single-turn questions ) ``` # License The code in this repo is released under Apache License, version 2.0. The SPINACH dataset, derived from the Wikidata Request a Query forum, is released under the CC BY-SA 4.0 license, the same license that covers the forum. # Citation ``` @misc{liu2024spinachsparqlbasedinformationnavigation, title={SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions}, author={Shicheng Liu and Sina J. Semnani and Harold Triedman and Jialiang Xu and Isaac Dan Zhao and Monica S. Lam}, year={2024}, eprint={2407.11417}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.11417}, } ```

Owner

Name: AKSW Research Group @ University of Leipzig
Login: AKSW
Kind: organization
Location: Leipzig

Website: http://aksw.org
Repositories: 358
Profile: https://github.com/AKSW

GitHub Events

Total

Push event: 2
Pull request review event: 1
Pull request event: 1
Fork event: 1
Create event: 1

Last Year

Push event: 2
Pull request review event: 1
Pull request event: 1
Fork event: 1
Create event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/aksw/aruqula

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/AKSW/ARUQULA/blob/main/

SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions

Owner

GitHub Events

Total

Last Year