llms4om

LLMs4OM: Matching Ontologies with Large Language Models

https://github.com/hamedbabaei/llms4om

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary

Keywords

llms ontology-mapping ontology-matching transformers
Last synced: 6 months ago · JSON representation ·

Repository

LLMs4OM: Matching Ontologies with Large Language Models

Basic Info
  • Host: GitHub
  • Owner: HamedBabaei
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 5.38 MB
Statistics
  • Stars: 30
  • Watchers: 3
  • Forks: 3
  • Open Issues: 2
  • Releases: 0
Topics
llms ontology-mapping ontology-matching transformers
Created over 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-EF3939?style=badge&logo=adobeacrobatreader&logoColor=white&color=black&labelColor=ec1c24)](https://arxiv.org/abs/2404.10317) [![The supplementary material](https://img.shields.io/badge/Supplementary%20Material-EF3939?style=badge&logo=adobeacrobatreader&logoColor=white&color=black&labelColor=ec1c24)](docs/LLMs4OM_Supplementary_Material.pdf)

What is the LLMs4OM?

The LLMs4OM framework is a novel approach for effective Ontology Matching (OM) using LLMs. This framework utilizes two modules for retrieval and matching, respectively, enhanced by zero-shot prompting across three ontology representations: concept, concept-parent, and concept-children. It is capable of comprehensive evaluations using 20 OM datasets (but not limited to) from various domains. The LLMs4OM framework, can match and even surpass the performance of traditional OM systems, particularly in complex matching scenarios.

The following diagram represent the LLMs4OM framework.

The LLMs4OM framework offers a retrieval augmented generation (RAG) approach within LLMs for OM. LLMs4OM uses $O{source}$ as query $Q(O{source})$ to retrieve possible matches for for any $Cs \in C{source}$ from $C{target} \in O{target}$. Where, $C{target}$ is stored in the knowledge base $KB(O{target})$. Later, $C{s}$ and obtained $Ct \in C{target}$ are used to query the LLM to check whether the $(Cs, C_t)$ pair is a match. As shown in above diagram, the framework comprises four main steps: 1) Concept representation, 2) Retriever model, 3) LLM, and 4) Post-processing.

Installation

You can also install and use the LLMs4OM using the following commands. ``` git clone https://github.com/HamedBabaei/LLMs4OM.git cd LLMs4OM

pip install -r requirements.txt mv .env-example .env `` Next, update your tokens in.envor if you don't want to useLLaMA-2orGPT-3.5LLMs just put dummy tokens there. Once you installed the requirements and prepared the.env` file, you can move forward with experimentation.

Quick Tour

A RAG specific quick tour with Mistral-7B and BERTRetriever using C representation. ```python from ontomap.ontology import MouseHumanOMDataset from ontomap.base import BaseConfig from ontomap.evaluation.evaluator import evaluator from ontomap.encoder import IRILabelInRAGEncoder from ontomap.ontology_matchers import MistralLLMBertRAG from ontomap.postprocess import process

Setting configurations for experimenting 'rag' on GPU with batch size of 16

config = BaseConfig(approach='rag').getargs(device='cuda', batchsize=16)

set dataset directory

config.root_dir = "datasets"

parse task source, target, and reference ontology

ontology = MouseHumanOMDataset().collect(rootdir=config.rootdir)

init encoder (concept-representation)

encoded_inputs = IRILabelInRAGEncoder()(ontology)

init Mistral-7B + BERT

model = MistralLLMBertRAG(config.MistralBertRAG)

generate results

predicts = model.generate(inputdata=encodedinputs)

post-processing

predicts, _ = process.postprocesshybrid(predicts=predicts, llmconfidenceth=0.7, irscore_threshold=0.9)

evaluation

results = evaluator(track='anatomy', predicts=predicts, references=ontology["reference"]) print(results) ```

A Retrieval specific quick tour with BERTRetriever using C representation. ```python from ontomap.ontology import MouseHumanOMDataset from ontomap.base import BaseConfig from ontomap.evaluation.evaluator import evaluator from ontomap.encoder.lightweight import IRILabelInLightweightEncoder from ontomap.ontology_matchers.retrieval.models import BERTRetrieval from ontomap.postprocess import process

Setting configurations for experimenting 'retrieval' on CPU

config = BaseConfig(approach='retrieval').get_args(device='cpu')

set dataset directory

config.root_dir = "datasets"

parse task source, target, and reference ontology

ontology = MouseHumanOMDataset().collect(rootdir=config.rootdir)

init encoder (concept-representation)

encoded_inputs = IRILabelInLightweightEncoder()(ontology)

init BERTRetrieval

model = BERTRetrieval(config.BERTRetrieval)

generate results

predicts = model.generate(inputdata=encodedinputs)

post-processing

predicts = process.evalpreprocessir_outputs(predicts=predicts)

evaluation

results = evaluator(track='anatomy', predicts=predicts, references=ontology["reference"]) print(results) ```

Retrieval Models Imports

python from ontomap.ontology_matchers.rag.models import ChatGPTOpenAIAdaRAG from ontomap.ontology_matchers.rag.models import FalconLLMAdaRAG, FalconLLMBertRAG from ontomap.ontology_matchers.rag.models import LLaMA7BLLMAdaRAG, LLaMA7BLLMBertRAG from ontomap.ontology_matchers.rag.models import MistralLLMAdaRAG, MistralLLMBertRAG from ontomap.ontology_matchers.rag.models import MPTLLMAdaRAG, MPTLLMBertRAG from ontomap.ontology_matchers.rag.models import VicunaLLMAdaRAG, VicunaLLMBertRAG from ontomap.ontology_matchers.rag.models import MambaLLMAdaRAG, MambaLLMBertRAG

LLMs Models Imports

python from ontomap.ontology_matchers.retrieval.models import AdaRetrieval from ontomap.ontology_matchers.retrieval.models import BERTRetrieval from ontomap.ontology_matchers.retrieval.models import SpecterBERTRetrieval from ontomap.ontology_matchers.retrieval.models import TFIDFRetrieval

Track Tasks Imports - Parser

```python

CommonKG track

from ontomap.ontology.commonkg import NellDbpediaOMDataset, YagoWikidataOMDataset

MSE track

from ontomap.ontology.mse import MaterialInformationEMMOOMDataset, MaterialInformationMatOntoMDataset

Phenotype track

from ontomap.ontology.phenotype import DoidOrdoOMDataset, HpMpOMDataset

Anatomy

from ontomap.ontology.anatomy import MouseHumanOMDataset

Biodiv

from ontomap.ontology.biodiv import EnvoSweetOMDataset, FishZooplanktonOMDataset,\ MacroalgaeMacrozoobenthosOMDataset, TaxrefldBacteriaNcbitaxonBacteriaOMDataset, \ TaxrefldChromistaNcbitaxonChromistaOMDataset, TaxrefldFungiNcbitaxonFungiOMDataset,\ TaxrefldPlantaeNcbitaxonPlantaeOMDataset, TaxrefldProtozoaNcbitaxonProtozoaOMDataset

Bio-ML

from ontomap.ontology.bioml import NCITDOIDDiseaseOMDataset, OMIMORDODiseaseOMDataset, \ SNOMEDFMABodyOMDataset, SNOMEDNCITNeoplasOMDataset, SNOMEDNCITPharmOMDataset ```

Concept-Representations - C, CC, and CP

```python

Retriever models concept representations

from ontomap.encoder.lightweight import IRILabelInLightweightEncoder # C from ontomap.encoder.lightweight import IRILabelChildrensInLightweightEncoder # CC from ontomap.encoder.lightweight import IRILabelParentsInLightweightEncoder # CP

RAG models concept representations

from ontomap.encoder.rag import IRILabelInRAGEncoder # C from ontomap.encoder.rag import IRILabelChildrensInRAGEncoder # CC from ontomap.encoder.rag import IRILabelParentsInRAGEncoder # CP ```

OMPipeline usage

To use LLMs4OM pipleine follow the followings. It will run one model at a time over 20 OM tasks.

```python from ontomap import OMPipelines

setting hyperparameters

approach="rag" encoder="rag" useallencoders=False approachencoderstoconsider=['label'] # C representation useallmodels=False loadfromjson=False device="cuda" doevaluation=False batchsize=16 llmconfidenceth=0.7 irscore_threshold=0.9 model="['MistralBertRAG']" outputs='output-rag-mistral'

arguments

args = { 'approach': '$approach', 'encoder': '$encoder', 'use-all-encoders': useallencoders, 'approach-encoders-to-consider': approachencoderstoconsider, 'use-all-models': useallmodels, 'models-to-consider': model, 'load-from-json': loadfromjson, 'device': device, 'do-evaluation': doevaluation, 'outputs-dir': outputs, 'batch-size': batchsize, 'llmconfidenceth': llmconfidenceth, 'irscorethreshold': irscore_threshold }

Running OMPipelines

runner = OMPipelines(**args)

runner() ```

Citation

If you found this project useful in your work or research please cite it by using this BibTeX entry:

Pre-print: bibtex @misc{giglou2024llms4om, title={LLMs4OM: Matching Ontologies with Large Language Models}, author={Hamed Babaei Giglou and Jennifer D'Souza and Felix Engel and Sören Auer}, year={2024}, eprint={2404.10317}, archivePrefix={arXiv}, primaryClass={cs.AI} }

Owner

  • Name: Hamed Babaei Giglou
  • Login: HamedBabaei
  • Kind: user
  • Location: Germany

Ph.D. Student in Computer Science

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Babaei Giglou"
  given-names: "Hamed"
  orcid: "https://orcid.org/0000−0003−3758−1454"
- family-names: "D'Souza"
  given-names: "Jennifer"
  orcid: "https://orcid.org/0000−0002−6616−9509"
- family-names: "Engel"
  given-names: "Felix"
  orcid: "https://orcid.org/0000-0002-3060-7052"
- family-names: "Auer"
  given-names: "Sören"
  orcid: "https://orcid.org/0000−−0002−0698−2864"
title: "LLMs4OM: Matching Ontologies with Large Language Models"
version: 1.0.0
date-released: 2023-09-15
url: "https://github.com/HamedBabaei/LLMs4OM"

GitHub Events

Total
  • Issues event: 1
  • Watch event: 15
  • Fork event: 4
Last Year
  • Issues event: 1
  • Watch event: 15
  • Fork event: 4

Dependencies

Dockerfile docker
  • jupyter/minimal-notebook latest build
requirements.txt pypi
  • accelerate *
  • bitsandbytes *
  • deeponto *
  • einops *
  • ontospy *
  • openai *
  • owlready2 *
  • pivottablejs *
  • protobuf *
  • python-dotenv *
  • rank_bm25 *
  • rapidfuzz *
  • rdflib *
  • scipy *
  • sentence-transformers *
  • sentencepiece *
  • torch *
  • transformers *