llms4om

LLMs4OM: Matching Ontologies with Large Language Models

https://github.com/hamedbabaei/llms4om

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary

Keywords

llms ontology-mapping ontology-matching transformers

Last synced: 6 months ago · JSON representation ·

Repository

LLMs4OM: Matching Ontologies with Large Language Models

Basic Info

Host: GitHub
Owner: HamedBabaei
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 5.38 MB

Statistics

Stars: 30
Watchers: 3
Forks: 3
Open Issues: 2
Releases: 0

Topics

llms ontology-mapping ontology-matching transformers

Created over 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-EF3939?style=badge&logo=adobeacrobatreader&logoColor=white&color=black&labelColor=ec1c24)](https://arxiv.org/abs/2404.10317) [![The supplementary material](https://img.shields.io/badge/Supplementary%20Material-EF3939?style=badge&logo=adobeacrobatreader&logoColor=white&color=black&labelColor=ec1c24)](docs/LLMs4OM_Supplementary_Material.pdf)

What is the LLMs4OM?

The LLMs4OM framework is a novel approach for effective Ontology Matching (OM) using LLMs. This framework utilizes two modules for retrieval and matching, respectively, enhanced by zero-shot prompting across three ontology representations: concept, concept-parent, and concept-children. It is capable of comprehensive evaluations using 20 OM datasets (but not limited to) from various domains. The LLMs4OM framework, can match and even surpass the performance of traditional OM systems, particularly in complex matching scenarios.

The following diagram represent the LLMs4OM framework.

The LLMs4OM framework offers a retrieval augmented generation (RAG) approach within LLMs for OM. LLMs4OM uses $O{source}$ as query $Q(O{source})$ to retrieve possible matches for for any $Cs \in C{source}$ from $C{target} \in O{target}$. Where, $C{target}$ is stored in the knowledge base $KB(O{target})$. Later, $C{s}$ and obtained $Ct \in C{target}$ are used to query the LLM to check whether the $(Cs, C_t)$ pair is a match. As shown in above diagram, the framework comprises four main steps: 1) Concept representation, 2) Retriever model, 3) LLM, and 4) Post-processing.

Installation

You can also install and use the LLMs4OM using the following commands. ``` git clone https://github.com/HamedBabaei/LLMs4OM.git cd LLMs4OM

pip install -r requirements.txt mv .env-example .env ``Next, update your tokens in.envor if you don't want to useLLaMA-2orGPT-3.5LLMs just put dummy tokens there. Once you installed the requirements and prepared the.env` file, you can move forward with experimentation.

Quick Tour

A RAG specific quick tour with Mistral-7B and BERTRetriever using C representation. ```python from ontomap.ontology import MouseHumanOMDataset from ontomap.base import BaseConfig from ontomap.evaluation.evaluator import evaluator from ontomap.encoder import IRILabelInRAGEncoder from ontomap.ontology_matchers import MistralLLMBertRAG from ontomap.postprocess import process

Setting configurations for experimenting 'rag' on GPU with batch size of 16

config = BaseConfig(approach='rag').getargs(device='cuda', batchsize=16)

set dataset directory

config.root_dir = "datasets"

parse task source, target, and reference ontology

ontology = MouseHumanOMDataset().collect(rootdir=config.rootdir)

init encoder (concept-representation)

encoded_inputs = IRILabelInRAGEncoder()(ontology)

init Mistral-7B + BERT

model = MistralLLMBertRAG(config.MistralBertRAG)

generate results

predicts = model.generate(inputdata=encodedinputs)

post-processing

predicts, _ = process.postprocesshybrid(predicts=predicts, llmconfidenceth=0.7, irscore_threshold=0.9)

evaluation

results = evaluator(track='anatomy', predicts=predicts, references=ontology["reference"]) print(results) ```

A Retrieval specific quick tour with BERTRetriever using C representation. ```python from ontomap.ontology import MouseHumanOMDataset from ontomap.base import BaseConfig from ontomap.evaluation.evaluator import evaluator from ontomap.encoder.lightweight import IRILabelInLightweightEncoder from ontomap.ontology_matchers.retrieval.models import BERTRetrieval from ontomap.postprocess import process

Setting configurations for experimenting 'retrieval' on CPU

config = BaseConfig(approach='retrieval').get_args(device='cpu')

set dataset directory

config.root_dir = "datasets"

parse task source, target, and reference ontology

ontology = MouseHumanOMDataset().collect(rootdir=config.rootdir)

init encoder (concept-representation)

encoded_inputs = IRILabelInLightweightEncoder()(ontology)

init BERTRetrieval

model = BERTRetrieval(config.BERTRetrieval)

generate results

predicts = model.generate(inputdata=encodedinputs)

post-processing

predicts = process.evalpreprocessir_outputs(predicts=predicts)

evaluation

results = evaluator(track='anatomy', predicts=predicts, references=ontology["reference"]) print(results) ```

Retrieval Models Imports

python from ontomap.ontology_matchers.rag.models import ChatGPTOpenAIAdaRAG from ontomap.ontology_matchers.rag.models import FalconLLMAdaRAG, FalconLLMBertRAG from ontomap.ontology_matchers.rag.models import LLaMA7BLLMAdaRAG, LLaMA7BLLMBertRAG from ontomap.ontology_matchers.rag.models import MistralLLMAdaRAG, MistralLLMBertRAG from ontomap.ontology_matchers.rag.models import MPTLLMAdaRAG, MPTLLMBertRAG from ontomap.ontology_matchers.rag.models import VicunaLLMAdaRAG, VicunaLLMBertRAG from ontomap.ontology_matchers.rag.models import MambaLLMAdaRAG, MambaLLMBertRAG

LLMs Models Imports

python from ontomap.ontology_matchers.retrieval.models import AdaRetrieval from ontomap.ontology_matchers.retrieval.models import BERTRetrieval from ontomap.ontology_matchers.retrieval.models import SpecterBERTRetrieval from ontomap.ontology_matchers.retrieval.models import TFIDFRetrieval

Track Tasks Imports - `Parser`

```python

CommonKG track

from ontomap.ontology.commonkg import NellDbpediaOMDataset, YagoWikidataOMDataset

MSE track

from ontomap.ontology.mse import MaterialInformationEMMOOMDataset, MaterialInformationMatOntoMDataset

Phenotype track

from ontomap.ontology.phenotype import DoidOrdoOMDataset, HpMpOMDataset

Anatomy

from ontomap.ontology.anatomy import MouseHumanOMDataset

Biodiv

from ontomap.ontology.biodiv import EnvoSweetOMDataset, FishZooplanktonOMDataset,\ MacroalgaeMacrozoobenthosOMDataset, TaxrefldBacteriaNcbitaxonBacteriaOMDataset, \ TaxrefldChromistaNcbitaxonChromistaOMDataset, TaxrefldFungiNcbitaxonFungiOMDataset,\ TaxrefldPlantaeNcbitaxonPlantaeOMDataset, TaxrefldProtozoaNcbitaxonProtozoaOMDataset

Bio-ML

from ontomap.ontology.bioml import NCITDOIDDiseaseOMDataset, OMIMORDODiseaseOMDataset, \ SNOMEDFMABodyOMDataset, SNOMEDNCITNeoplasOMDataset, SNOMEDNCITPharmOMDataset ```

Concept-Representations - `C`, `CC`, and `CP`

```python

Retriever models concept representations

from ontomap.encoder.lightweight import IRILabelInLightweightEncoder # C from ontomap.encoder.lightweight import IRILabelChildrensInLightweightEncoder # CC from ontomap.encoder.lightweight import IRILabelParentsInLightweightEncoder # CP

RAG models concept representations

from ontomap.encoder.rag import IRILabelInRAGEncoder # C from ontomap.encoder.rag import IRILabelChildrensInRAGEncoder # CC from ontomap.encoder.rag import IRILabelParentsInRAGEncoder # CP ```

OMPipeline usage

To use LLMs4OM pipleine follow the followings. It will run one model at a time over 20 OM tasks.

```python from ontomap import OMPipelines

setting hyperparameters

approach="rag" encoder="rag" useallencoders=False approachencoderstoconsider=['label'] # C representation useallmodels=False loadfromjson=False device="cuda" doevaluation=False batchsize=16 llmconfidenceth=0.7 irscore_threshold=0.9 model="['MistralBertRAG']" outputs='output-rag-mistral'

arguments

args = { 'approach': '$approach', 'encoder': '$encoder', 'use-all-encoders': useallencoders, 'approach-encoders-to-consider': approachencoderstoconsider, 'use-all-models': useallmodels, 'models-to-consider': model, 'load-from-json': loadfromjson, 'device': device, 'do-evaluation': doevaluation, 'outputs-dir': outputs, 'batch-size': batchsize, 'llmconfidenceth': llmconfidenceth, 'irscorethreshold': irscore_threshold }

Running OMPipelines

runner = OMPipelines(**args)

runner() ```

Citation

If you found this project useful in your work or research please cite it by using this BibTeX entry:

Pre-print: bibtex @misc{giglou2024llms4om, title={LLMs4OM: Matching Ontologies with Large Language Models}, author={Hamed Babaei Giglou and Jennifer D'Souza and Felix Engel and Sören Auer}, year={2024}, eprint={2404.10317}, archivePrefix={arXiv}, primaryClass={cs.AI} }

Owner

Name: Hamed Babaei Giglou
Login: HamedBabaei
Kind: user
Location: Germany

Repositories: 1
Profile: https://github.com/HamedBabaei

Ph.D. Student in Computer Science

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Babaei Giglou"
  given-names: "Hamed"
  orcid: "https://orcid.org/0000−0003−3758−1454"
- family-names: "D'Souza"
  given-names: "Jennifer"
  orcid: "https://orcid.org/0000−0002−6616−9509"
- family-names: "Engel"
  given-names: "Felix"
  orcid: "https://orcid.org/0000-0002-3060-7052"
- family-names: "Auer"
  given-names: "Sören"
  orcid: "https://orcid.org/0000−−0002−0698−2864"
title: "LLMs4OM: Matching Ontologies with Large Language Models"
version: 1.0.0
date-released: 2023-09-15
url: "https://github.com/HamedBabaei/LLMs4OM"

GitHub Events

Total

Issues event: 1
Watch event: 15
Fork event: 4

Last Year

Issues event: 1
Watch event: 15
Fork event: 4

Dependencies

Dockerfile docker

jupyter/minimal-notebook latest build

requirements.txt pypi

accelerate *
bitsandbytes *
deeponto *
einops *
ontospy *
openai *
owlready2 *
pivottablejs *
protobuf *
python-dotenv *
rank_bm25 *
rapidfuzz *
rdflib *
scipy *
sentence-transformers *
sentencepiece *
torch *
transformers *

llms4om

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

What is the LLMs4OM?

Installation

Quick Tour

Setting configurations for experimenting 'rag' on GPU with batch size of 16

set dataset directory

parse task source, target, and reference ontology

init encoder (concept-representation)

init Mistral-7B + BERT

generate results

post-processing

evaluation

Setting configurations for experimenting 'retrieval' on CPU

set dataset directory

parse task source, target, and reference ontology

init encoder (concept-representation)

init BERTRetrieval

generate results

post-processing

evaluation

Retrieval Models Imports

LLMs Models Imports

Track Tasks Imports - Parser

CommonKG track

MSE track

Phenotype track

Anatomy

Biodiv

Bio-ML

Concept-Representations - C, CC, and CP

Retriever models concept representations

RAG models concept representations

OMPipeline usage

setting hyperparameters

arguments

Running OMPipelines

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

Track Tasks Imports - `Parser`

Concept-Representations - `C`, `CC`, and `CP`