csvw-ontomap

🗺️ ️Generate CSVW metadata for tabular data files, and map columns to terms in a given OWL ontology using semantic search

https://github.com/vemonet/csvw-ontomap

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: science.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.1%) to scientific vocabulary

Keywords

csvw data-extraction linked-data ontology-mapping owl-ontology

Last synced: 6 months ago · JSON representation ·

Repository

🗺️ ️Generate CSVW metadata for tabular data files, and map columns to terms in a given OWL ontology using semantic search

Basic Info

Host: GitHub
Owner: vemonet
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 119 KB

Statistics

Stars: 2
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0

Topics

csvw data-extraction linked-data ontology-mapping owl-ontology

Created about 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

# 🔎 CSVW OntoMap 🗺️ [![Test package](https://github.com/vemonet/csvw-ontomap/actions/workflows/test.yml/badge.svg)](https://github.com/vemonet/csvw-ontomap/actions/workflows/test.yml) [![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch) [![linting - Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![code style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy)

Automatically generate descriptive CSVW (CSV on the Web) metadata for tabular data files:

Extract columns datatypes: detect if they are categorical, and which values are accepted, using ydata-profiling.
Ontology mappings: when provided with a URL to an OWL ontology, text embeddings are generated and stored in a local Qdrant vector database for all classes and properties, we use similarity search to match each data column to the most relevant ontology terms.
Currently supports: CSV, Excel, SPSS files. Any format that can be loaded in a Pandas DataFrame could be easily added, create an issue on GitHub to request a new format to be added.
- Processed files needs to contain 1 sheet, if multiple sheets are present in a file only the first one will be processed.

[!WARNING]

The lib does not check yet if the VectorDB has been fully loaded. It will skip loading if there is at least 2 vectors in the DB. So if you stop the loading process halfway through, you will need to delete the VectorDB folder to make sure the lib run the ontology loading.

📦️ Installation

This package requires Python >=3.8, simply install it with:

bash pip install git+https://github.com/vemonet/csvw-ontomap.git

🪄 Usage

⌨️ Use as a command-line interface

You can easily use your package from your terminal after installing csvw-ontomap with pip:

bash csvw-ontomap tests/resources/*.csv

Store CSVW metadata report JSON-LD output to file:

bash csvw-ontomap tests/resources/*.csv -o csvw-report.json

Store CSVW metadata report as CSV file:

bash csvw-ontomap tests/resources/*.csv -o csvw-report.csv

Provide the URL to an OWL ontology that will be used to map the column names:

bash csvw-ontomap tests/resources/*.csv -m https://semanticscience.org/ontology/sio.owl

Specify the path to store the vectors (default is data/vectordb):

bash csvw-ontomap tests/resources/*.csv -m https://semanticscience.org/ontology/sio.owl -d data/vectordb

🐍 Use with python

Use this package in python scripts:

```python from csvw_ontomap import CsvwProfiler, OntomapConfig import json

profiler = CsvwProfiler( ontologies=["https://semanticscience.org/ontology/sio.owl"], vectordbpath="data/vectordb", config=OntomapConfig( # Optional commentbestmatches=3, # Add the ontology matches as comment searchthreshold=0, # Between 0 and 1 ), ) csvwreport = profiler.profilefiles([ "tests/resources/.csv", "tests/resources/.xlsx", "tests/resources/*.spss", ]) print(json.dumps(csvw_report, indent=2)) ```

🧑‍💻 Development setup

The final section of the README is for if you want to run the package in development, and get involved by making a code contribution.

📥️ Clone

Clone the repository:

bash git clone https://github.com/vemonet/csvw-ontomap cd csvw-ontomap

🐣 Install dependencies

Install Hatch, this will automatically handle virtual environments and make sure all dependencies are installed when you run a script in the project:

bash pipx install hatch

☑️ Run tests

Make sure the existing tests still work by running the test suite and linting checks. Note that any pull requests to the fairworkflows repository on github will automatically trigger running of the test suite;

bash hatch run test

To display all logs when debugging:

bash hatch run test -s

♻️ Reset the environment

In case you are facing issues with dependencies not updating properly you can easily reset the virtual environment with:

bash hatch env prune

Manually trigger installing the dependencies in a local virtual environment:

bash hatch -v env create

🏷️ New release process

The deployment of new releases is done automatically by a GitHub Action workflow when a new release is created on GitHub. To release a new version:

Make sure the PYPI_TOKEN secret has been defined in the GitHub repository (in Settings > Secrets > Actions). You can get an API token from PyPI at pypi.org/manage/account.
Increment the version number in the pyproject.toml file in the root folder of the repository.

bash hatch version fix
Create a new release on GitHub, which will automatically trigger the publish workflow, and publish the new release to PyPI.

You can also do it locally:

bash hatch build hatch publish

Owner

Name: Vincent Emonet
Login: vemonet
Kind: user
Location: Maastricht, Netherlands
Company: @MaastrichtU-IDS

Website: https://vemonet.github.io
Repositories: 203
Profile: https://github.com/vemonet

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - orcid: https://orcid.org/0000-0002-1501-1082
    email: vincent.emonet@gmail.com
    given-names: Vincent Emonet
    # affiliation: Institute of Data Science, Maastricht University
title: "CSVW profiling"
repository-code: https://github.com/vemonet/csvw-ontomap
date-released: 2023-12-04
url: https://pypi.org/project/csvw-ontomap
# doi: 10.48550/arXiv.2206.13787

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Committers

Last synced: about 2 years ago

All Time

Total Commits: 9
Total Committers: 1
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 9
Committers: 1
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Vincent Emonet	v**t@g**m	9

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/publish.yml actions

actions/checkout v4 composite
actions/setup-python v4 composite
pypa/gh-action-pypi-publish release/v1 composite

.github/workflows/test.yml actions

actions/checkout v4 composite
actions/setup-python v4 composite
github/codeql-action/analyze v2 composite
github/codeql-action/autobuild v2 composite
github/codeql-action/init v2 composite

pyproject.toml pypi

csvw *
openpyxl *
owlready2 *
pandas *
qdrant-client [fastembed]
typer >=0.6.0
ydata-profiling *

csvw-ontomap

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

📦️ Installation

🪄 Usage

⌨️ Use as a command-line interface

🐍 Use with python

🧑‍💻 Development setup

📥️ Clone

🐣 Install dependencies

☑️ Run tests

♻️ Reset the environment

🏷️ New release process

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies