llms4om
LLMs4OM: Matching Ontologies with Large Language Models
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary
Keywords
Repository
LLMs4OM: Matching Ontologies with Large Language Models
Basic Info
Statistics
- Stars: 30
- Watchers: 3
- Forks: 3
- Open Issues: 2
- Releases: 0
Topics
Metadata Files
README.md
What is the LLMs4OM?
The LLMs4OM framework is a novel approach for effective Ontology Matching (OM) using LLMs. This framework utilizes two modules for retrieval and matching, respectively, enhanced by zero-shot prompting across three ontology representations: concept, concept-parent, and concept-children. It is capable of comprehensive evaluations using 20 OM datasets (but not limited to) from various domains. The LLMs4OM framework, can match and even surpass the performance of traditional OM systems, particularly in complex matching scenarios.
The following diagram represent the LLMs4OM framework.
The LLMs4OM framework offers a retrieval augmented generation (RAG) approach within LLMs for OM. LLMs4OM uses $O{source}$ as query $Q(O{source})$ to retrieve possible matches for for any $Cs \in C{source}$ from $C{target} \in O{target}$. Where, $C{target}$ is stored in the knowledge base $KB(O{target})$. Later, $C{s}$ and obtained $Ct \in C{target}$ are used to query the LLM to check whether the $(Cs, C_t)$ pair is a match. As shown in above diagram, the framework comprises four main steps: 1) Concept representation, 2) Retriever model, 3) LLM, and 4) Post-processing.
Installation
You can also install and use the LLMs4OM using the following commands. ``` git clone https://github.com/HamedBabaei/LLMs4OM.git cd LLMs4OM
pip install -r requirements.txt
mv .env-example .env
``
Next, update your tokens in.envor if you don't want to useLLaMA-2orGPT-3.5LLMs just put dummy tokens there.
Once you installed the requirements and prepared the.env` file, you can move forward with experimentation.
Quick Tour
A RAG specific quick tour with Mistral-7B and BERTRetriever using C representation.
```python
from ontomap.ontology import MouseHumanOMDataset
from ontomap.base import BaseConfig
from ontomap.evaluation.evaluator import evaluator
from ontomap.encoder import IRILabelInRAGEncoder
from ontomap.ontology_matchers import MistralLLMBertRAG
from ontomap.postprocess import process
Setting configurations for experimenting 'rag' on GPU with batch size of 16
config = BaseConfig(approach='rag').getargs(device='cuda', batchsize=16)
set dataset directory
config.root_dir = "datasets"
parse task source, target, and reference ontology
ontology = MouseHumanOMDataset().collect(rootdir=config.rootdir)
init encoder (concept-representation)
encoded_inputs = IRILabelInRAGEncoder()(ontology)
init Mistral-7B + BERT
model = MistralLLMBertRAG(config.MistralBertRAG)
generate results
predicts = model.generate(inputdata=encodedinputs)
post-processing
predicts, _ = process.postprocesshybrid(predicts=predicts, llmconfidenceth=0.7, irscore_threshold=0.9)
evaluation
results = evaluator(track='anatomy', predicts=predicts, references=ontology["reference"]) print(results) ```
A Retrieval specific quick tour with BERTRetriever using C representation.
```python
from ontomap.ontology import MouseHumanOMDataset
from ontomap.base import BaseConfig
from ontomap.evaluation.evaluator import evaluator
from ontomap.encoder.lightweight import IRILabelInLightweightEncoder
from ontomap.ontology_matchers.retrieval.models import BERTRetrieval
from ontomap.postprocess import process
Setting configurations for experimenting 'retrieval' on CPU
config = BaseConfig(approach='retrieval').get_args(device='cpu')
set dataset directory
config.root_dir = "datasets"
parse task source, target, and reference ontology
ontology = MouseHumanOMDataset().collect(rootdir=config.rootdir)
init encoder (concept-representation)
encoded_inputs = IRILabelInLightweightEncoder()(ontology)
init BERTRetrieval
model = BERTRetrieval(config.BERTRetrieval)
generate results
predicts = model.generate(inputdata=encodedinputs)
post-processing
predicts = process.evalpreprocessir_outputs(predicts=predicts)
evaluation
results = evaluator(track='anatomy', predicts=predicts, references=ontology["reference"]) print(results) ```
Retrieval Models Imports
python
from ontomap.ontology_matchers.rag.models import ChatGPTOpenAIAdaRAG
from ontomap.ontology_matchers.rag.models import FalconLLMAdaRAG, FalconLLMBertRAG
from ontomap.ontology_matchers.rag.models import LLaMA7BLLMAdaRAG, LLaMA7BLLMBertRAG
from ontomap.ontology_matchers.rag.models import MistralLLMAdaRAG, MistralLLMBertRAG
from ontomap.ontology_matchers.rag.models import MPTLLMAdaRAG, MPTLLMBertRAG
from ontomap.ontology_matchers.rag.models import VicunaLLMAdaRAG, VicunaLLMBertRAG
from ontomap.ontology_matchers.rag.models import MambaLLMAdaRAG, MambaLLMBertRAG
LLMs Models Imports
python
from ontomap.ontology_matchers.retrieval.models import AdaRetrieval
from ontomap.ontology_matchers.retrieval.models import BERTRetrieval
from ontomap.ontology_matchers.retrieval.models import SpecterBERTRetrieval
from ontomap.ontology_matchers.retrieval.models import TFIDFRetrieval
Track Tasks Imports - Parser
```python
CommonKG track
from ontomap.ontology.commonkg import NellDbpediaOMDataset, YagoWikidataOMDataset
MSE track
from ontomap.ontology.mse import MaterialInformationEMMOOMDataset, MaterialInformationMatOntoMDataset
Phenotype track
from ontomap.ontology.phenotype import DoidOrdoOMDataset, HpMpOMDataset
Anatomy
from ontomap.ontology.anatomy import MouseHumanOMDataset
Biodiv
from ontomap.ontology.biodiv import EnvoSweetOMDataset, FishZooplanktonOMDataset,\ MacroalgaeMacrozoobenthosOMDataset, TaxrefldBacteriaNcbitaxonBacteriaOMDataset, \ TaxrefldChromistaNcbitaxonChromistaOMDataset, TaxrefldFungiNcbitaxonFungiOMDataset,\ TaxrefldPlantaeNcbitaxonPlantaeOMDataset, TaxrefldProtozoaNcbitaxonProtozoaOMDataset
Bio-ML
from ontomap.ontology.bioml import NCITDOIDDiseaseOMDataset, OMIMORDODiseaseOMDataset, \ SNOMEDFMABodyOMDataset, SNOMEDNCITNeoplasOMDataset, SNOMEDNCITPharmOMDataset ```
Concept-Representations - C, CC, and CP
```python
Retriever models concept representations
from ontomap.encoder.lightweight import IRILabelInLightweightEncoder # C from ontomap.encoder.lightweight import IRILabelChildrensInLightweightEncoder # CC from ontomap.encoder.lightweight import IRILabelParentsInLightweightEncoder # CP
RAG models concept representations
from ontomap.encoder.rag import IRILabelInRAGEncoder # C from ontomap.encoder.rag import IRILabelChildrensInRAGEncoder # CC from ontomap.encoder.rag import IRILabelParentsInRAGEncoder # CP ```
OMPipeline usage
To use LLMs4OM pipleine follow the followings. It will run one model at a time over 20 OM tasks.
```python from ontomap import OMPipelines
setting hyperparameters
approach="rag" encoder="rag" useallencoders=False approachencoderstoconsider=['label'] # C representation useallmodels=False loadfromjson=False device="cuda" doevaluation=False batchsize=16 llmconfidenceth=0.7 irscore_threshold=0.9 model="['MistralBertRAG']" outputs='output-rag-mistral'
arguments
args = { 'approach': '$approach', 'encoder': '$encoder', 'use-all-encoders': useallencoders, 'approach-encoders-to-consider': approachencoderstoconsider, 'use-all-models': useallmodels, 'models-to-consider': model, 'load-from-json': loadfromjson, 'device': device, 'do-evaluation': doevaluation, 'outputs-dir': outputs, 'batch-size': batchsize, 'llmconfidenceth': llmconfidenceth, 'irscorethreshold': irscore_threshold }
Running OMPipelines
runner = OMPipelines(**args)
runner() ```
Citation
If you found this project useful in your work or research please cite it by using this BibTeX entry:
Pre-print:
bibtex
@misc{giglou2024llms4om,
title={LLMs4OM: Matching Ontologies with Large Language Models},
author={Hamed Babaei Giglou and Jennifer D'Souza and Felix Engel and Sören Auer},
year={2024},
eprint={2404.10317},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
Owner
- Name: Hamed Babaei Giglou
- Login: HamedBabaei
- Kind: user
- Location: Germany
- Repositories: 1
- Profile: https://github.com/HamedBabaei
Ph.D. Student in Computer Science
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Babaei Giglou" given-names: "Hamed" orcid: "https://orcid.org/0000−0003−3758−1454" - family-names: "D'Souza" given-names: "Jennifer" orcid: "https://orcid.org/0000−0002−6616−9509" - family-names: "Engel" given-names: "Felix" orcid: "https://orcid.org/0000-0002-3060-7052" - family-names: "Auer" given-names: "Sören" orcid: "https://orcid.org/0000−−0002−0698−2864" title: "LLMs4OM: Matching Ontologies with Large Language Models" version: 1.0.0 date-released: 2023-09-15 url: "https://github.com/HamedBabaei/LLMs4OM"
GitHub Events
Total
- Issues event: 1
- Watch event: 15
- Fork event: 4
Last Year
- Issues event: 1
- Watch event: 15
- Fork event: 4
Dependencies
- jupyter/minimal-notebook latest build
- accelerate *
- bitsandbytes *
- deeponto *
- einops *
- ontospy *
- openai *
- owlready2 *
- pivottablejs *
- protobuf *
- python-dotenv *
- rank_bm25 *
- rapidfuzz *
- rdflib *
- scipy *
- sentence-transformers *
- sentencepiece *
- torch *
- transformers *