kitsune

Kitsune is a next-generation data steward and harmonization tool.

https://github.com/scai-bio/kitsune

Science Score: 85.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
✓
Committers with academic emails
1 of 5 committers (20.0%) from academic institutions
✓
Institutional organization owner
Organization scai-bio has institutional domain (www.scai.fraunhofer.de)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary

Keywords

data-harmonization data-stewardship embeddings large-language-models semantic-mapping

Keywords from Contributors

interactive mesh interpretability profiles sequences generic projection optim hacking network-simulation

Last synced: 9 months ago · JSON representation ·

Repository

Kitsune is a next-generation data steward and harmonization tool.

Basic Info

Host: GitHub
Owner: SCAI-BIO
License: apache-2.0
Language: TypeScript
Default Branch: main
Homepage: https://kitsune.scai.fraunhofer.de
Size: 12 MB

Statistics

Stars: 3
Watchers: 2
Forks: 1
Open Issues: 11
Releases: 35

Topics

data-harmonization data-stewardship embeddings large-language-models semantic-mapping

Created over 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

Kitsune

GitHub Release

Kitsune is a next-generation data steward and harmonization tool. Building on the legacy of systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar terms even when their string representations differ substantially. This results in more robust data harmonization and improved performance in real-world scenarios.

(Formerly: INDEX – the Intelligent Data Steward Toolbox)

Features

LLM Embeddings: Uses state-of-the-art language models to capture semantic similarity.
Intelligent Mapping: Improves over traditional string matching with context-aware comparisons.
Extensible: Designed for integration into modern data harmonization pipelines.

Installation

Run the frontend client, API, vector database and local embedding model using the local docker-compose file:

bash docker-compose -f docker-compose.local.yaml up

Once running, you can access the frontend on localhost:4200

Ontology Import via API

The API supports multiple methods for importing ontology (terminology) data into the system. Depending on your source and requirements, you can choose from the following options:

Importing from OLS (Pre-integrated):

The API is integrated with the Ontology Lookup Service (OLS), allowing you to import any ontology from their catalog.

bash curl -X 'PUT' \ '{api_url}/imports/terminology?terminology_id={terminology_id}&model={vectorizer_model}' \ -H 'accept: application/json'

terminology_id (required): The ID of the ontology you want to import (e.g., hp, efo, chebi).
vectorizer_model (optional): The vectorizer model to use for generating embeddings.

Example:

bash curl -X 'PUT' \ '{api_url}/imports/terminology?terminology_id=hp' \ -H 'accept: application/json'

Importing SNOMED CT:

SNOMED CT can be imported using a shortcut endpoint. This is equivalent to using the OLS integration with terminology_id=snomed, but provides a cleaner interface.

bash curl -X 'PUT' \ '{api_url}/imports/terminology/snomed?model={vectorizer_model}' \ -H 'accept: application/json'

Parameters:

vectorizer_model (optional): The vectorizer model to be used for generating embeddings.

Importing Your Own Ontology (JSONL Files):

For full flexibility, you can upload your own ontology using .jsonl (JSON Lines) files. This allows you to import:

Terminologies (namespaces)
Concepts (terms within the terminology)
Mappings (links between embeddings and existing concepts)

⚠️ The objects should be imported in the following order:

"Terminology"

"Concepts"

"Mappings"

bash curl -X 'PUT' \ '{api_url}/imports/jsonl?object_type={object_type}' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'file=@{your_file}.jsonl'

object_type(required): One of terminology, concept, or mapping
file (required): The .jsonl file to be uploaded (multipart/from-data)

JSONL File Structure

Each line in your .jsonl file must represent a single object. The structures for Terminology, Concept, and Mapping are described below.

Terminology

Represents an ontology namescape.

Attributes:

id: Abbreviation of the terminology.
name: Full name of the terminology.

json { "id": "OHDSI", "name": "Observational Health Data Sciences and Informatics" }

Concept

Represents an individual entry within a terminology.

Attributes:

concept_identifier: Concept entry ID within the terminology.
pref_label: Preferred label for the entry.
terminology_id: Reference to the terminology it belongs to.

json { "concept_identifier": "OHDSI:45756805", "pref_label": "Pediatric Cardiology", "terminology_id": "OHDSI" }

Mapping

Links a textual description to a concept.

Attributes:

text: Description of the associated concept, or pref_label if the description is missing.
concept_identifier: Reference to the associated concept.

json { "text": "Pediatric Cardiology", "concept_identifier": "OHDSI:45756805" }

Owner

Name: Fraunhofer SCAI Bioinformatics Department
Login: SCAI-BIO
Kind: organization

Website: https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/
Repositories: 29
Profile: https://github.com/SCAI-BIO

Deparment of Bioinformatics at Fraunhofer SCAI

Citation (CITATION.cff)

cff-version: 1.2.0
title: Kitsune: a next-generation data steward and harmonization tool  
type: software
message: If you use this software, please cite it as below.
license: Apache-2.0
language: en
authors:
  - given-names: Mehmet Can
    family-names: Ay
    email: mehmet.ay@scai.fraunhofer.de
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    orcid: https://orcid.org/0000-0002-2977-7695
  - given-names: Tim
    family-names: Adams
    email: tim.adams@scai.fraunhofer.de
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    orcid: https://orcid.org/0000-0002-2823-0102
repository-code: https://github.com/SCAI-BIO/kitsune
abstract: >
  Kitsune is a next-generation data steward and harmonization tool. Building on the legacy of 
  systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar 
  terms even when their string representations differ substantially. This results in more robust data 
  harmonization and improved performance in real-world scenarios.
keywords:
  - data harmonization
  - data stewardship
  - large language models
  - LLM

preferred-citation:
  type: article
  authors:
    - family-names: Salimi
      given-names: Yasamin
    - family-names: Adams
      given-names: Tim
    - family-names: Ay
      given-names: Mehmet Can
    - family-names: Balabin
      given-names: Helena
    - family-names: Jacobs
      given-names: Marc
    - family-names: Hofmann-Apitius
      given-names: Martin
  title: Evaluating language model embeddings for Parkinson’s disease cohort harmonization using a novel manually curated variable mapping schema
  journal: Scientific Reports
  year: 2025
  doi: 10.1038/s41598-025-06447-2

references:
  - type: conference-paper
    title: INDEX — the Intelligent Data Steward Toolbox
    doi: 10.24406/publica-4577
  - type: poster
    title: INDEX — the Intelligent Data Steward Toolbox
    doi: 10.4126/FRL01-006472846

GitHub Events

Total

Create event: 53
Release event: 8
Issues event: 27
Delete event: 43
Issue comment event: 18
Push event: 103
Pull request review event: 9
Pull request review comment event: 6
Pull request event: 91

Last Year

Create event: 53
Release event: 8
Issues event: 27
Delete event: 43
Issue comment event: 18
Push event: 103
Pull request review event: 9
Pull request review comment event: 6
Pull request event: 91

Committers

Last synced: about 1 year ago

All Time

Total Commits: 266
Total Committers: 5
Avg Commits per committer: 53.2
Development Distribution Score (DDS): 0.508

Past Year

Commits: 207
Committers: 5
Avg Commits per committer: 41.4
Development Distribution Score (DDS): 0.401

Top Committers

Name	Email	Commits
Mehmet Can Ay	m**y@g**m	131
TimAdams84	t**s@g**t	120
dependabot[bot]	4****]	7
Christian Ebeling	c**g@s**e	5
Raimondo Lazzara	l**a@w**e	3

Committer Domains (Top 20 + Academic)

w11k.de: 1 scai.fraunhofer.de: 1 gmx.net: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 18
Total pull requests: 79
Average time to close issues: 25 days
Average time to close pull requests: 3 days
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.61
Average comments per pull request: 0.08
Merged pull requests: 59
Bot issues: 0
Bot pull requests: 25

Past Year

Issues: 18
Pull requests: 79
Average time to close issues: 25 days
Average time to close pull requests: 3 days
Issue authors: 2
Pull request authors: 3
Average comments per issue: 0.61
Average comments per pull request: 0.08
Merged pull requests: 59
Bot issues: 0
Bot pull requests: 25

View more stats

Top Authors

Issue Authors

mehmetcanay (11)
tiadams (7)

Pull Request Authors

mehmetcanay (45)
dependabot[bot] (25)
tiadams (9)

Top Labels

Issue Labels

enhancement (5) bug (4) typescript (4) chore (3) python (1) frontend (1) documentation (1)

Pull Request Labels

dependencies (25) javascript (21) typescript (13) chore (11) python (9) enhancement (6) bug (5) documentation (4) frontend (1)

Dependencies

.github/workflows/tests.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

requirements.txt pypi

matplotlib *
numpy ==1.25.2
openai *
openpyxl *
pandas ==2.1.0
pip ==21.3.1
plotly *
python-dateutil ==2.8.2
python-dotenv *
pytz ==2023.3
scikit-learn *
seaborn *
setuptools ==60.2.0
six ==1.16.0
thefuzz *
tzdata ==2023.3
wheel ==0.37.1

.github/workflows/docker-package.yml actions

actions/checkout v2 composite
docker/build-push-action v2 composite
docker/login-action v1 composite

Dockerfile docker

python 3.9 build

.github/workflows/python-publish.yaml actions

actions/checkout v3 composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite

poetry.lock pypi

aiofiles 0.7.0
aiohttp 3.9.3
aiosignal 1.3.1
anyio 4.3.0
async-timeout 4.0.3
attrs 23.2.0
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
contourpy 1.2.0
cycler 0.12.1
et-xmlfile 1.1.0
exceptiongroup 1.2.0
fastapi 0.87.0
filelock 3.13.1
fonttools 4.49.0
frozenlist 1.4.1
fsspec 2024.2.0
greenlet 3.0.3
h11 0.14.0
huggingface-hub 0.20.3
idna 3.6
importlib-resources 6.1.1
jinja2 3.1.3
joblib 1.3.2
kiwisolver 1.4.5
markupsafe 2.1.5
matplotlib 3.8.3
mpmath 1.3.0
multidict 6.0.5
networkx 3.2.1
nltk 3.8.1
numpy 1.25.2
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
openai 0.28.1
openpyxl 3.1.2
packaging 23.2
pandas 2.1.0
pillow 10.2.0
pip 21.3.1
plotly 5.17.0
pydantic 1.10.14
pyparsing 3.1.1
python-dateutil 2.8.2
python-dotenv 1.0.1
python-multipart 0.0.9
pytz 2023.3
pyyaml 6.0.1
rapidfuzz 3.6.1
regex 2023.12.25
requests 2.31.0
safetensors 0.4.2
scikit-learn 1.3.2
scipy 1.11.4
seaborn 0.13.2
sentence-transformers 2.3.1
sentencepiece 0.2.0
setuptools 60.2.0
six 1.16.0
sniffio 1.3.0
sqlalchemy 2.0.27
starlette 0.21.0
sympy 1.12
tenacity 8.2.3
thefuzz 0.20.0
threadpoolctl 3.3.0
tokenizers 0.15.2
torch 2.2.1
tqdm 4.66.2
transformers 4.38.1
triton 2.2.0
typing-extensions 4.9.0
tzdata 2023.3
urllib3 2.2.1
uvicorn 0.27.1
wheel 0.37.1
yarl 1.9.4
zipp 3.17.0

pyproject.toml pypi

aiofiles >=0.7.0,<0.8.0
fastapi >=0.87.0,<0.88.0
matplotlib >=3.8.1,<3.9.0
numpy 1.25.2
openai >=0.28.0,<0.29.0
openpyxl ^3.1.2
pandas 2.1.0
pip 21.3.1
plotly >=5.17.0,<5.18.0
python ^3.9
python-dateutil 2.8.2
python-dotenv >=1.0.0,<1.1.0
python-multipart ^0.0.9
pytz 2023.3
scikit-learn 1.3.2
scipy >=1.11.4,<1.12.0
seaborn >=0.13.0,<0.14.0
sentence-transformers 2.3.1
setuptools 60.2.0
six 1.16.0
sqlalchemy >=2.0.27,<2.1.0
starlette >=0.21.0,<0.22.0
thefuzz >=0.20.0,<0.21.0
tzdata 2023.3
uvicorn >=0.15.0
wheel 0.37.1

setup.py pypi

kitsune

Science Score: 85.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Kitsune

Features

Installation

Ontology Import via API

JSONL File Structure

Terminology

Concept

Mapping

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies