kitsune
Kitsune is a next-generation data steward and harmonization tool.
Science Score: 85.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
✓Committers with academic emails
1 of 5 committers (20.0%) from academic institutions -
✓Institutional organization owner
Organization scai-bio has institutional domain (www.scai.fraunhofer.de) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Kitsune is a next-generation data steward and harmonization tool.
Basic Info
- Host: GitHub
- Owner: SCAI-BIO
- License: apache-2.0
- Language: TypeScript
- Default Branch: main
- Homepage: https://kitsune.scai.fraunhofer.de
- Size: 12 MB
Statistics
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 11
- Releases: 35
Topics
Metadata Files
README.md
Kitsune
Kitsune is a next-generation data steward and harmonization tool. Building on the legacy of systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar terms even when their string representations differ substantially. This results in more robust data harmonization and improved performance in real-world scenarios.
(Formerly: INDEX – the Intelligent Data Steward Toolbox)
Features
- LLM Embeddings: Uses state-of-the-art language models to capture semantic similarity.
- Intelligent Mapping: Improves over traditional string matching with context-aware comparisons.
- Extensible: Designed for integration into modern data harmonization pipelines.
Installation
Run the frontend client, API, vector database and local embedding model using the local docker-compose file:
bash
docker-compose -f docker-compose.local.yaml up
Once running, you can access the frontend on localhost:4200
Ontology Import via API
The API supports multiple methods for importing ontology (terminology) data into the system. Depending on your source and requirements, you can choose from the following options:
- Importing from OLS (Pre-integrated):
The API is integrated with the Ontology Lookup Service (OLS), allowing you to import any ontology from their catalog.
bash
curl -X 'PUT' \
'{api_url}/imports/terminology?terminology_id={terminology_id}&model={vectorizer_model}' \
-H 'accept: application/json'
terminology_id(required): The ID of the ontology you want to import (e.g.,hp,efo,chebi).vectorizer_model(optional): The vectorizer model to use for generating embeddings.
Example:
bash
curl -X 'PUT' \
'{api_url}/imports/terminology?terminology_id=hp' \
-H 'accept: application/json'
- Importing SNOMED CT:
- SNOMED CT can be imported using a shortcut endpoint. This is equivalent to using the OLS integration with
terminology_id=snomed, but provides a cleaner interface.
bash
curl -X 'PUT' \
'{api_url}/imports/terminology/snomed?model={vectorizer_model}' \
-H 'accept: application/json'
Parameters:
vectorizer_model(optional): The vectorizer model to be used for generating embeddings.
- Importing Your Own Ontology (JSONL Files):
For full flexibility, you can upload your own ontology using .jsonl (JSON Lines) files. This allows you to import:
- Terminologies (namespaces)
- Concepts (terms within the terminology)
- Mappings (links between embeddings and existing concepts)
⚠️ The objects should be imported in the following order:
- "Terminology"
- "Concepts"
- "Mappings"
bash
curl -X 'PUT' \
'{api_url}/imports/jsonl?object_type={object_type}' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@{your_file}.jsonl'
object_type(required): One ofterminology,concept, ormappingfile(required): The.jsonlfile to be uploaded (multipart/from-data)
JSONL File Structure
Each line in your .jsonl file must represent a single object. The structures for Terminology, Concept, and Mapping are described below.
Terminology
Represents an ontology namescape.
Attributes:
id: Abbreviation of the terminology.name: Full name of the terminology.
json
{
"id": "OHDSI",
"name": "Observational Health Data Sciences and Informatics"
}
Concept
Represents an individual entry within a terminology.
Attributes:
concept_identifier: Concept entry ID within the terminology.pref_label: Preferred label for the entry.terminology_id: Reference to the terminology it belongs to.
json
{
"concept_identifier": "OHDSI:45756805",
"pref_label": "Pediatric Cardiology",
"terminology_id": "OHDSI"
}
Mapping
Links a textual description to a concept.
Attributes:
text: Description of the associated concept, orpref_labelif the description is missing.concept_identifier: Reference to the associated concept.
json
{
"text": "Pediatric Cardiology",
"concept_identifier": "OHDSI:45756805"
}
Owner
- Name: Fraunhofer SCAI Bioinformatics Department
- Login: SCAI-BIO
- Kind: organization
- Website: https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/
- Repositories: 29
- Profile: https://github.com/SCAI-BIO
Deparment of Bioinformatics at Fraunhofer SCAI
Citation (CITATION.cff)
cff-version: 1.2.0
title: Kitsune: a next-generation data steward and harmonization tool
type: software
message: If you use this software, please cite it as below.
license: Apache-2.0
language: en
authors:
- given-names: Mehmet Can
family-names: Ay
email: mehmet.ay@scai.fraunhofer.de
affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
orcid: https://orcid.org/0000-0002-2977-7695
- given-names: Tim
family-names: Adams
email: tim.adams@scai.fraunhofer.de
affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
orcid: https://orcid.org/0000-0002-2823-0102
repository-code: https://github.com/SCAI-BIO/kitsune
abstract: >
Kitsune is a next-generation data steward and harmonization tool. Building on the legacy of
systems like Usagi, Kitsune leverages LLM embeddings to intelligently map semantically similar
terms even when their string representations differ substantially. This results in more robust data
harmonization and improved performance in real-world scenarios.
keywords:
- data harmonization
- data stewardship
- large language models
- LLM
preferred-citation:
type: article
authors:
- family-names: Salimi
given-names: Yasamin
- family-names: Adams
given-names: Tim
- family-names: Ay
given-names: Mehmet Can
- family-names: Balabin
given-names: Helena
- family-names: Jacobs
given-names: Marc
- family-names: Hofmann-Apitius
given-names: Martin
title: Evaluating language model embeddings for Parkinson’s disease cohort harmonization using a novel manually curated variable mapping schema
journal: Scientific Reports
year: 2025
doi: 10.1038/s41598-025-06447-2
references:
- type: conference-paper
title: INDEX — the Intelligent Data Steward Toolbox
doi: 10.24406/publica-4577
- type: poster
title: INDEX — the Intelligent Data Steward Toolbox
doi: 10.4126/FRL01-006472846
GitHub Events
Total
- Create event: 53
- Release event: 8
- Issues event: 27
- Delete event: 43
- Issue comment event: 18
- Push event: 103
- Pull request review event: 9
- Pull request review comment event: 6
- Pull request event: 91
Last Year
- Create event: 53
- Release event: 8
- Issues event: 27
- Delete event: 43
- Issue comment event: 18
- Push event: 103
- Pull request review event: 9
- Pull request review comment event: 6
- Pull request event: 91
Committers
Last synced: 10 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Mehmet Can Ay | m****y@g****m | 131 |
| TimAdams84 | t****s@g****t | 120 |
| dependabot[bot] | 4****] | 7 |
| Christian Ebeling | c****g@s****e | 5 |
| Raimondo Lazzara | l****a@w****e | 3 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 18
- Total pull requests: 79
- Average time to close issues: 25 days
- Average time to close pull requests: 3 days
- Total issue authors: 2
- Total pull request authors: 3
- Average comments per issue: 0.61
- Average comments per pull request: 0.08
- Merged pull requests: 59
- Bot issues: 0
- Bot pull requests: 25
Past Year
- Issues: 18
- Pull requests: 79
- Average time to close issues: 25 days
- Average time to close pull requests: 3 days
- Issue authors: 2
- Pull request authors: 3
- Average comments per issue: 0.61
- Average comments per pull request: 0.08
- Merged pull requests: 59
- Bot issues: 0
- Bot pull requests: 25
Top Authors
Issue Authors
- mehmetcanay (11)
- tiadams (7)
Pull Request Authors
- mehmetcanay (45)
- dependabot[bot] (25)
- tiadams (9)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v3 composite
- actions/setup-python v3 composite
- matplotlib *
- numpy ==1.25.2
- openai *
- openpyxl *
- pandas ==2.1.0
- pip ==21.3.1
- plotly *
- python-dateutil ==2.8.2
- python-dotenv *
- pytz ==2023.3
- scikit-learn *
- seaborn *
- setuptools ==60.2.0
- six ==1.16.0
- thefuzz *
- tzdata ==2023.3
- wheel ==0.37.1
- actions/checkout v2 composite
- docker/build-push-action v2 composite
- docker/login-action v1 composite
- python 3.9 build
- actions/checkout v3 composite
- actions/setup-python v3 composite
- pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
- aiofiles 0.7.0
- aiohttp 3.9.3
- aiosignal 1.3.1
- anyio 4.3.0
- async-timeout 4.0.3
- attrs 23.2.0
- certifi 2024.2.2
- charset-normalizer 3.3.2
- click 8.1.7
- colorama 0.4.6
- contourpy 1.2.0
- cycler 0.12.1
- et-xmlfile 1.1.0
- exceptiongroup 1.2.0
- fastapi 0.87.0
- filelock 3.13.1
- fonttools 4.49.0
- frozenlist 1.4.1
- fsspec 2024.2.0
- greenlet 3.0.3
- h11 0.14.0
- huggingface-hub 0.20.3
- idna 3.6
- importlib-resources 6.1.1
- jinja2 3.1.3
- joblib 1.3.2
- kiwisolver 1.4.5
- markupsafe 2.1.5
- matplotlib 3.8.3
- mpmath 1.3.0
- multidict 6.0.5
- networkx 3.2.1
- nltk 3.8.1
- numpy 1.25.2
- nvidia-cublas-cu12 12.1.3.1
- nvidia-cuda-cupti-cu12 12.1.105
- nvidia-cuda-nvrtc-cu12 12.1.105
- nvidia-cuda-runtime-cu12 12.1.105
- nvidia-cudnn-cu12 8.9.2.26
- nvidia-cufft-cu12 11.0.2.54
- nvidia-curand-cu12 10.3.2.106
- nvidia-cusolver-cu12 11.4.5.107
- nvidia-cusparse-cu12 12.1.0.106
- nvidia-nccl-cu12 2.19.3
- nvidia-nvjitlink-cu12 12.3.101
- nvidia-nvtx-cu12 12.1.105
- openai 0.28.1
- openpyxl 3.1.2
- packaging 23.2
- pandas 2.1.0
- pillow 10.2.0
- pip 21.3.1
- plotly 5.17.0
- pydantic 1.10.14
- pyparsing 3.1.1
- python-dateutil 2.8.2
- python-dotenv 1.0.1
- python-multipart 0.0.9
- pytz 2023.3
- pyyaml 6.0.1
- rapidfuzz 3.6.1
- regex 2023.12.25
- requests 2.31.0
- safetensors 0.4.2
- scikit-learn 1.3.2
- scipy 1.11.4
- seaborn 0.13.2
- sentence-transformers 2.3.1
- sentencepiece 0.2.0
- setuptools 60.2.0
- six 1.16.0
- sniffio 1.3.0
- sqlalchemy 2.0.27
- starlette 0.21.0
- sympy 1.12
- tenacity 8.2.3
- thefuzz 0.20.0
- threadpoolctl 3.3.0
- tokenizers 0.15.2
- torch 2.2.1
- tqdm 4.66.2
- transformers 4.38.1
- triton 2.2.0
- typing-extensions 4.9.0
- tzdata 2023.3
- urllib3 2.2.1
- uvicorn 0.27.1
- wheel 0.37.1
- yarl 1.9.4
- zipp 3.17.0
- aiofiles >=0.7.0,<0.8.0
- fastapi >=0.87.0,<0.88.0
- matplotlib >=3.8.1,<3.9.0
- numpy 1.25.2
- openai >=0.28.0,<0.29.0
- openpyxl ^3.1.2
- pandas 2.1.0
- pip 21.3.1
- plotly >=5.17.0,<5.18.0
- python ^3.9
- python-dateutil 2.8.2
- python-dotenv >=1.0.0,<1.1.0
- python-multipart ^0.0.9
- pytz 2023.3
- scikit-learn 1.3.2
- scipy >=1.11.4,<1.12.0
- seaborn >=0.13.0,<0.14.0
- sentence-transformers 2.3.1
- setuptools 60.2.0
- six 1.16.0
- sqlalchemy >=2.0.27,<2.1.0
- starlette >=0.21.0,<0.22.0
- thefuzz >=0.20.0,<0.21.0
- tzdata 2023.3
- uvicorn >=0.15.0
- wheel 0.37.1