kitsune

Kitsune is a next-generation data steward and harmonization tool.

https://github.com/scai-bio/kitsune

Science Score: 85.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    1 of 5 committers (20.0%) from academic institutions
  • Institutional organization owner
    Organization scai-bio has institutional domain (www.scai.fraunhofer.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.7%) to scientific vocabulary

Keywords

data-harmonization data-stewardship embeddings large-language-models semantic-mapping

Keywords from Contributors

interactive mesh interpretability profiles sequences generic projection optim hacking network-simulation
Last synced: 6 months ago · JSON representation ·

Repository

Kitsune is a next-generation data steward and harmonization tool.

Basic Info
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 1
  • Open Issues: 11
  • Releases: 35
Topics
data-harmonization data-stewardship embeddings large-language-models semantic-mapping
Created about 2 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

Logo Kitsune

DOI tests tests GitHub Release

Kitsune is a next-generation data steward and harmonization tool. Building on the legacy of systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar terms even when their string representations differ substantially. This results in more robust data harmonization and improved performance in real-world scenarios.

(Formerly: INDEX – the Intelligent Data Steward Toolbox)

Features

  • LLM Embeddings: Uses state-of-the-art language models to capture semantic similarity.
  • Intelligent Mapping: Improves over traditional string matching with context-aware comparisons.
  • Extensible: Designed for integration into modern data harmonization pipelines.

Installation

Run the frontend client, API, vector database and local embedding model using the local docker-compose file:

bash docker-compose -f docker-compose.local.yaml up

Once running, you can access the frontend on localhost:4200

Ontology Import via API

The API supports multiple methods for importing ontology (terminology) data into the system. Depending on your source and requirements, you can choose from the following options:

  1. Importing from OLS (Pre-integrated):

The API is integrated with the Ontology Lookup Service (OLS), allowing you to import any ontology from their catalog.

bash curl -X 'PUT' \ '{api_url}/imports/terminology?terminology_id={terminology_id}&model={vectorizer_model}' \ -H 'accept: application/json'

  • terminology_id (required): The ID of the ontology you want to import (e.g., hp, efo, chebi).
  • vectorizer_model (optional): The vectorizer model to use for generating embeddings.

Example:

bash curl -X 'PUT' \ '{api_url}/imports/terminology?terminology_id=hp' \ -H 'accept: application/json'

  1. Importing SNOMED CT:
  • SNOMED CT can be imported using a shortcut endpoint. This is equivalent to using the OLS integration with terminology_id=snomed, but provides a cleaner interface.

bash curl -X 'PUT' \ '{api_url}/imports/terminology/snomed?model={vectorizer_model}' \ -H 'accept: application/json'

Parameters:

  • vectorizer_model (optional): The vectorizer model to be used for generating embeddings.
  1. Importing Your Own Ontology (JSONL Files):

For full flexibility, you can upload your own ontology using .jsonl (JSON Lines) files. This allows you to import:

  • Terminologies (namespaces)
  • Concepts (terms within the terminology)
  • Mappings (links between embeddings and existing concepts)

⚠️ The objects should be imported in the following order:

  1. "Terminology"
  2. "Concepts"
  3. "Mappings"

bash curl -X 'PUT' \ '{api_url}/imports/jsonl?object_type={object_type}' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'file=@{your_file}.jsonl'

  • object_type(required): One of terminology, concept, or mapping
  • file (required): The .jsonl file to be uploaded (multipart/from-data)

JSONL File Structure

Each line in your .jsonl file must represent a single object. The structures for Terminology, Concept, and Mapping are described below.

Terminology

Represents an ontology namescape.

Attributes:

  • id: Abbreviation of the terminology.
  • name: Full name of the terminology.

json { "id": "OHDSI", "name": "Observational Health Data Sciences and Informatics" }

Concept

Represents an individual entry within a terminology.

Attributes:

  • concept_identifier: Concept entry ID within the terminology.
  • pref_label: Preferred label for the entry.
  • terminology_id: Reference to the terminology it belongs to.

json { "concept_identifier": "OHDSI:45756805", "pref_label": "Pediatric Cardiology", "terminology_id": "OHDSI" }

Mapping

Links a textual description to a concept.

Attributes:

  • text: Description of the associated concept, or pref_label if the description is missing.
  • concept_identifier: Reference to the associated concept.

json { "text": "Pediatric Cardiology", "concept_identifier": "OHDSI:45756805" }

Owner

  • Name: Fraunhofer SCAI Bioinformatics Department
  • Login: SCAI-BIO
  • Kind: organization

Deparment of Bioinformatics at Fraunhofer SCAI

Citation (CITATION.cff)

cff-version: 1.2.0
title: Kitsune: a next-generation data steward and harmonization tool  
type: software
message: If you use this software, please cite it as below.
license: Apache-2.0
language: en
authors:
  - given-names: Mehmet Can
    family-names: Ay
    email: mehmet.ay@scai.fraunhofer.de
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    orcid: https://orcid.org/0000-0002-2977-7695
  - given-names: Tim
    family-names: Adams
    email: tim.adams@scai.fraunhofer.de
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    orcid: https://orcid.org/0000-0002-2823-0102
repository-code: https://github.com/SCAI-BIO/kitsune
abstract: >
  Kitsune is a next-generation data steward and harmonization tool. Building on the legacy of 
  systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar 
  terms even when their string representations differ substantially. This results in more robust data 
  harmonization and improved performance in real-world scenarios.
keywords:
  - data harmonization
  - data stewardship
  - large language models
  - LLM

preferred-citation:
  type: article
  authors:
    - family-names: Salimi
      given-names: Yasamin
    - family-names: Adams
      given-names: Tim
    - family-names: Ay
      given-names: Mehmet Can
    - family-names: Balabin
      given-names: Helena
    - family-names: Jacobs
      given-names: Marc
    - family-names: Hofmann-Apitius
      given-names: Martin
  title: Evaluating language model embeddings for Parkinson’s disease cohort harmonization using a novel manually curated variable mapping schema
  journal: Scientific Reports
  year: 2025
  doi: 10.1038/s41598-025-06447-2

references:
  - type: conference-paper
    title: INDEX — the Intelligent Data Steward Toolbox
    doi: 10.24406/publica-4577
  - type: poster
    title: INDEX — the Intelligent Data Steward Toolbox
    doi: 10.4126/FRL01-006472846

GitHub Events

Total
  • Create event: 53
  • Release event: 8
  • Issues event: 27
  • Delete event: 43
  • Issue comment event: 18
  • Push event: 103
  • Pull request review event: 9
  • Pull request review comment event: 6
  • Pull request event: 91
Last Year
  • Create event: 53
  • Release event: 8
  • Issues event: 27
  • Delete event: 43
  • Issue comment event: 18
  • Push event: 103
  • Pull request review event: 9
  • Pull request review comment event: 6
  • Pull request event: 91

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 266
  • Total Committers: 5
  • Avg Commits per committer: 53.2
  • Development Distribution Score (DDS): 0.508
Past Year
  • Commits: 207
  • Committers: 5
  • Avg Commits per committer: 41.4
  • Development Distribution Score (DDS): 0.401
Top Committers
Name Email Commits
Mehmet Can Ay m****y@g****m 131
TimAdams84 t****s@g****t 120
dependabot[bot] 4****] 7
Christian Ebeling c****g@s****e 5
Raimondo Lazzara l****a@w****e 3
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 18
  • Total pull requests: 79
  • Average time to close issues: 25 days
  • Average time to close pull requests: 3 days
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 0.61
  • Average comments per pull request: 0.08
  • Merged pull requests: 59
  • Bot issues: 0
  • Bot pull requests: 25
Past Year
  • Issues: 18
  • Pull requests: 79
  • Average time to close issues: 25 days
  • Average time to close pull requests: 3 days
  • Issue authors: 2
  • Pull request authors: 3
  • Average comments per issue: 0.61
  • Average comments per pull request: 0.08
  • Merged pull requests: 59
  • Bot issues: 0
  • Bot pull requests: 25
Top Authors
Issue Authors
  • mehmetcanay (11)
  • tiadams (7)
Pull Request Authors
  • mehmetcanay (45)
  • dependabot[bot] (25)
  • tiadams (9)
Top Labels
Issue Labels
enhancement (5) bug (4) typescript (4) chore (3) python (1) frontend (1) documentation (1)
Pull Request Labels
dependencies (25) javascript (21) typescript (13) chore (11) python (9) enhancement (6) bug (5) documentation (4) frontend (1)

Dependencies

.github/workflows/tests.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
requirements.txt pypi
  • matplotlib *
  • numpy ==1.25.2
  • openai *
  • openpyxl *
  • pandas ==2.1.0
  • pip ==21.3.1
  • plotly *
  • python-dateutil ==2.8.2
  • python-dotenv *
  • pytz ==2023.3
  • scikit-learn *
  • seaborn *
  • setuptools ==60.2.0
  • six ==1.16.0
  • thefuzz *
  • tzdata ==2023.3
  • wheel ==0.37.1
.github/workflows/docker-package.yml actions
  • actions/checkout v2 composite
  • docker/build-push-action v2 composite
  • docker/login-action v1 composite
Dockerfile docker
  • python 3.9 build
.github/workflows/python-publish.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
  • pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
poetry.lock pypi
  • aiofiles 0.7.0
  • aiohttp 3.9.3
  • aiosignal 1.3.1
  • anyio 4.3.0
  • async-timeout 4.0.3
  • attrs 23.2.0
  • certifi 2024.2.2
  • charset-normalizer 3.3.2
  • click 8.1.7
  • colorama 0.4.6
  • contourpy 1.2.0
  • cycler 0.12.1
  • et-xmlfile 1.1.0
  • exceptiongroup 1.2.0
  • fastapi 0.87.0
  • filelock 3.13.1
  • fonttools 4.49.0
  • frozenlist 1.4.1
  • fsspec 2024.2.0
  • greenlet 3.0.3
  • h11 0.14.0
  • huggingface-hub 0.20.3
  • idna 3.6
  • importlib-resources 6.1.1
  • jinja2 3.1.3
  • joblib 1.3.2
  • kiwisolver 1.4.5
  • markupsafe 2.1.5
  • matplotlib 3.8.3
  • mpmath 1.3.0
  • multidict 6.0.5
  • networkx 3.2.1
  • nltk 3.8.1
  • numpy 1.25.2
  • nvidia-cublas-cu12 12.1.3.1
  • nvidia-cuda-cupti-cu12 12.1.105
  • nvidia-cuda-nvrtc-cu12 12.1.105
  • nvidia-cuda-runtime-cu12 12.1.105
  • nvidia-cudnn-cu12 8.9.2.26
  • nvidia-cufft-cu12 11.0.2.54
  • nvidia-curand-cu12 10.3.2.106
  • nvidia-cusolver-cu12 11.4.5.107
  • nvidia-cusparse-cu12 12.1.0.106
  • nvidia-nccl-cu12 2.19.3
  • nvidia-nvjitlink-cu12 12.3.101
  • nvidia-nvtx-cu12 12.1.105
  • openai 0.28.1
  • openpyxl 3.1.2
  • packaging 23.2
  • pandas 2.1.0
  • pillow 10.2.0
  • pip 21.3.1
  • plotly 5.17.0
  • pydantic 1.10.14
  • pyparsing 3.1.1
  • python-dateutil 2.8.2
  • python-dotenv 1.0.1
  • python-multipart 0.0.9
  • pytz 2023.3
  • pyyaml 6.0.1
  • rapidfuzz 3.6.1
  • regex 2023.12.25
  • requests 2.31.0
  • safetensors 0.4.2
  • scikit-learn 1.3.2
  • scipy 1.11.4
  • seaborn 0.13.2
  • sentence-transformers 2.3.1
  • sentencepiece 0.2.0
  • setuptools 60.2.0
  • six 1.16.0
  • sniffio 1.3.0
  • sqlalchemy 2.0.27
  • starlette 0.21.0
  • sympy 1.12
  • tenacity 8.2.3
  • thefuzz 0.20.0
  • threadpoolctl 3.3.0
  • tokenizers 0.15.2
  • torch 2.2.1
  • tqdm 4.66.2
  • transformers 4.38.1
  • triton 2.2.0
  • typing-extensions 4.9.0
  • tzdata 2023.3
  • urllib3 2.2.1
  • uvicorn 0.27.1
  • wheel 0.37.1
  • yarl 1.9.4
  • zipp 3.17.0
pyproject.toml pypi
  • aiofiles >=0.7.0,<0.8.0
  • fastapi >=0.87.0,<0.88.0
  • matplotlib >=3.8.1,<3.9.0
  • numpy 1.25.2
  • openai >=0.28.0,<0.29.0
  • openpyxl ^3.1.2
  • pandas 2.1.0
  • pip 21.3.1
  • plotly >=5.17.0,<5.18.0
  • python ^3.9
  • python-dateutil 2.8.2
  • python-dotenv >=1.0.0,<1.1.0
  • python-multipart ^0.0.9
  • pytz 2023.3
  • scikit-learn 1.3.2
  • scipy >=1.11.4,<1.12.0
  • seaborn >=0.13.0,<0.14.0
  • sentence-transformers 2.3.1
  • setuptools 60.2.0
  • six 1.16.0
  • sqlalchemy >=2.0.27,<2.1.0
  • starlette >=0.21.0,<0.22.0
  • thefuzz >=0.20.0,<0.21.0
  • tzdata 2023.3
  • uvicorn >=0.15.0
  • wheel 0.37.1
setup.py pypi