https://github.com/fanavarro/ocalm
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: fanavarro
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 187 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
OCALM: Ontology Coverage Analysis through Language Models
OCALM is a Python script for measuring the domain coverage of an OWL ontology. It takes an ontology (RDF/XML, OWL/XML or NTriples are accepted) and free text as input, identify the noun phrases in the free text and tries to match them with ontology classes.
Installation
The script requires Python 3.8 and a UNIX system. It was tested in Ubuntu and CentOS using Python 3.8.19. We recommend to use a virtual environment to execute OCALM.
Python virtual environment
Some issues when installing the required libraries have been detected when using Python 3.10. If you do not have a Python 3.8 version, please use the conda installation.
Move to the project folder and create a virtual environment:
python -m venv .venv
Activate the virtual environment
source .venv/bin/activate
Install the required libraries:
python -m pip install -r requirements.txt
Conda virtual environment
This installation method requires conda. See this for conda installation instructions.
Move to the project folder and create the conda environment:
conda env create -f conda-environment.yaml
This will have created an environment called ocalm. To activate it, use the following command:
conda activate ocalm
Test the application
The following command run the application by using a test ontology and a test text corpus, included within the repository:
python ocalm.py --text_folder resources/test_text/folder1 --ontology resources/test_ontologies/ontology.owl --output_prefix testCoverageMetricResults/ --threads 8
If it run successfully, you will see a folder testCoverageMetricResults with the results.
Usage
``` usage: python ocalm.py [-h] --textfolder TEXTFOLDER --ontology ONTOLOGY [--termfreqthreshold TERMFREQTHRESHOLD] --outputprefix OUTPUTPREFIX [--threads THREADS]
optional arguments: -h, --help show this help message and exit --textfolder TEXTFOLDER Folder containing natural language text files. --ontology ONTOLOGY Ontology file in RDF/XML, OWL/XML or NTriples. --termfreqthreshold TERMFREQTHRESHOLD Threshold to filter the detected noun phrases in the free text based on the noun phrase frequency in the corpus. This threshold is based on the normalized frequency of each term (term frequency / max frequency found), that is from 0 to 1. --outputprefix OUTPUTPREFIX Output prefix to store the results --threads THREADS Threads to use ```
Outputs
The output consists of 2 files:
- text2class.tsv: shows the best ontology class match for each noun phrase identified in the text files.
- class2text.tsv: shows the best noun phrase match for each ontology class.
Both files show the lexical, semantic, and general score for each match, together with the 10 nearest neighbors found for both the noun phrase extracted from the text files and the ontology class.
Owner
- Name: Francisco Abad
- Login: fanavarro
- Kind: user
- Repositories: 5
- Profile: https://github.com/fanavarro
GitHub Events
Total
Last Year
Dependencies
- Jinja2 ==3.1.2
- MarkupSafe ==2.1.3
- annotated-types ==0.5.0
- blis ==0.7.10
- catalogue ==2.0.9
- certifi ==2023.7.22
- charset-normalizer ==3.2.0
- click ==8.1.7
- confection ==0.1.3
- cymem ==2.0.8
- fuzzywuzzy ==0.18.0
- gensim ==4.3.2
- idna ==3.4
- isodate ==0.6.1
- joblib ==1.3.2
- krippendorff ==0.4.0
- langcodes ==3.3.0
- levenshtein ==0.25.0
- murmurhash ==1.0.10
- nltk ==3.8.1
- numpy ==1.24.4
- owlready2 ==0.44
- packaging ==23.1
- pandas ==2.0.3
- pathy ==0.10.2
- preshed ==3.0.9
- pydantic ==2.3.0
- pydantic_core ==2.6.3
- pyparsing ==3.1.1
- python-dateutil ==2.8.2
- pytorch-lightning ==1.3.0
- pytz ==2023.3.post1
- rdflib ==7.0.0
- regex ==2023.8.8
- requests ==2.31.0
- scikit-learn ==1.3.0
- scipy ==1.10.1
- sentence-transformers ==0.3.3
- six ==1.16.0
- smart-open ==6.4.0
- spacy ==3.6.1
- spacy-legacy ==3.0.12
- spacy-loggers ==1.0.5
- srsly ==2.4.7
- thinc ==8.1.12
- threadpoolctl ==3.2.0
- tqdm ==4.66.1
- transformers ==3.0.2
- typer ==0.9.0
- typing_extensions ==4.8.0
- tzdata ==2023.3
- urllib3 ==2.0.4
- wasabi ==1.1.2