https://github.com/fanavarro/ocalm

https://github.com/fanavarro/ocalm

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.5%) to scientific vocabulary
Last synced: 8 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: fanavarro
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 187 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

OCALM: Ontology Coverage Analysis through Language Models

OCALM is a Python script for measuring the domain coverage of an OWL ontology. It takes an ontology (RDF/XML, OWL/XML or NTriples are accepted) and free text as input, identify the noun phrases in the free text and tries to match them with ontology classes.

Installation

The script requires Python 3.8 and a UNIX system. It was tested in Ubuntu and CentOS using Python 3.8.19. We recommend to use a virtual environment to execute OCALM.

Python virtual environment

Some issues when installing the required libraries have been detected when using Python 3.10. If you do not have a Python 3.8 version, please use the conda installation.

Move to the project folder and create a virtual environment:

python -m venv .venv

Activate the virtual environment

source .venv/bin/activate

Install the required libraries:

python -m pip install -r requirements.txt

Conda virtual environment

This installation method requires conda. See this for conda installation instructions.

Move to the project folder and create the conda environment:

conda env create -f conda-environment.yaml

This will have created an environment called ocalm. To activate it, use the following command:

conda activate ocalm

Test the application

The following command run the application by using a test ontology and a test text corpus, included within the repository:

python ocalm.py --text_folder resources/test_text/folder1 --ontology resources/test_ontologies/ontology.owl --output_prefix testCoverageMetricResults/ --threads 8

If it run successfully, you will see a folder testCoverageMetricResults with the results.

Usage

``` usage: python ocalm.py [-h] --textfolder TEXTFOLDER --ontology ONTOLOGY [--termfreqthreshold TERMFREQTHRESHOLD] --outputprefix OUTPUTPREFIX [--threads THREADS]

optional arguments: -h, --help show this help message and exit --textfolder TEXTFOLDER Folder containing natural language text files. --ontology ONTOLOGY Ontology file in RDF/XML, OWL/XML or NTriples. --termfreqthreshold TERMFREQTHRESHOLD Threshold to filter the detected noun phrases in the free text based on the noun phrase frequency in the corpus. This threshold is based on the normalized frequency of each term (term frequency / max frequency found), that is from 0 to 1. --outputprefix OUTPUTPREFIX Output prefix to store the results --threads THREADS Threads to use ```

Outputs

The output consists of 2 files:

  • text2class.tsv: shows the best ontology class match for each noun phrase identified in the text files.
  • class2text.tsv: shows the best noun phrase match for each ontology class.

Both files show the lexical, semantic, and general score for each match, together with the 10 nearest neighbors found for both the noun phrase extracted from the text files and the ontology class.

Owner

  • Name: Francisco Abad
  • Login: fanavarro
  • Kind: user

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • Jinja2 ==3.1.2
  • MarkupSafe ==2.1.3
  • annotated-types ==0.5.0
  • blis ==0.7.10
  • catalogue ==2.0.9
  • certifi ==2023.7.22
  • charset-normalizer ==3.2.0
  • click ==8.1.7
  • confection ==0.1.3
  • cymem ==2.0.8
  • fuzzywuzzy ==0.18.0
  • gensim ==4.3.2
  • idna ==3.4
  • isodate ==0.6.1
  • joblib ==1.3.2
  • krippendorff ==0.4.0
  • langcodes ==3.3.0
  • levenshtein ==0.25.0
  • murmurhash ==1.0.10
  • nltk ==3.8.1
  • numpy ==1.24.4
  • owlready2 ==0.44
  • packaging ==23.1
  • pandas ==2.0.3
  • pathy ==0.10.2
  • preshed ==3.0.9
  • pydantic ==2.3.0
  • pydantic_core ==2.6.3
  • pyparsing ==3.1.1
  • python-dateutil ==2.8.2
  • pytorch-lightning ==1.3.0
  • pytz ==2023.3.post1
  • rdflib ==7.0.0
  • regex ==2023.8.8
  • requests ==2.31.0
  • scikit-learn ==1.3.0
  • scipy ==1.10.1
  • sentence-transformers ==0.3.3
  • six ==1.16.0
  • smart-open ==6.4.0
  • spacy ==3.6.1
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.5
  • srsly ==2.4.7
  • thinc ==8.1.12
  • threadpoolctl ==3.2.0
  • tqdm ==4.66.1
  • transformers ==3.0.2
  • typer ==0.9.0
  • typing_extensions ==4.8.0
  • tzdata ==2023.3
  • urllib3 ==2.0.4
  • wasabi ==1.1.2