datastew

Python library for intelligent data stewardship using Large Language Model (LLM) embeddings

https://github.com/scai-bio/datastew

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
✓
Institutional organization owner
Organization scai-bio has institutional domain (www.scai.fraunhofer.de)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary

Keywords

data-harmonization data-stewardship large-language-models

Keywords from Contributors

embedded interactive mesh interpretability profiles sequences generic projection standardization optim

Last synced: 10 months ago · JSON representation ·

Repository

Python library for intelligent data stewardship using Large Language Model (LLM) embeddings

Basic Info

Host: GitHub
Owner: SCAI-BIO
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://pypi.org/project/datastew/
Size: 1.93 MB

Statistics

Stars: 5
Watchers: 2
Forks: 0
Open Issues: 11
Releases: 22

Topics

data-harmonization data-stewardship large-language-models

Created about 2 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

datastew

GitHub Release

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

Installation

bash pip install datastew

Usage

Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a csv, tsv or excel file. An example how to match two separate variable descriptions is shown in datastew/scripts/mappingexcelexample.py:

```python from datastew.process.parsing import DataDictionarySource from datastew.process.mapping import mapdictionaryto_dictionary

Variable and description refer to the corresponding column names in your excel sheet

source = DataDictionarySource("source.xlxs", variablefield="var", descriptionfield="desc") target = DataDictionarySource("target.xlxs", variablefield="var", descriptionfield="desc")

df = mapdictionarytodictionary(source, target) df.toexcel("result.xlxs") ```

The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches as well as a similarity measure per row.

Per default this will use the local MiniLM model, which may not yield the optimal performance. If you got an OpenAI API key it is possible to use their embedding API instead. To use your key, create a Vectorizer model and pass it to the function:

```python from datastew.embedding import Vectorizer from datastew.process.mapping import mapdictionaryto_dictionary

vectorizer = Vectorizer("text-embedding-ada-002", key="yourapikey") df = mapdictionaryto_dictionary(source, target, vectorizer=vectorizer) ```

Creating and using stored mappings

A simple example how to initialize an in memory database and compute a similarity mapping is shown in datastew/scripts/mappingdbexample.py:

1) Initialize the repository and embedding model:

```python
from datastew.embedding import Vectorizer
from datastew.repository import WeaviateRepository
from datastew.repository.model import Terminology, Concept, Mapping

repository = WeaviateRepository(mode='remote', path='localhost', port=8080)
vectorizer = Vectorizer()
# vectorizer = Vectorizer("text-embedding-ada-002", key="your_key") # Use this line for higher accuracy if you have an OpenAI API key
```

2) Create a baseline of data to map to in the initialized repository. Text gets attached to any unique concept of an existing or custom vocabulary or terminology namespace in the form of a mapping object containing the text, embedding, and the name of sentence embedder used. Multiple Mapping objects with textually different but semantically equal descriptions can point to the same Concept.

```python
terminology = Terminology("snomed CT", "SNOMED")

text1 = "Diabetes mellitus (disorder)"
concept1 = Concept(terminology, text1, "Concept ID: 11893007")
mapping1 = Mapping(concept1, text1, vectorizer.get_embedding(text1), vectorizer.model_name)

text2 = "Hypertension (disorder)"
concept2 = Concept(terminology, text2, "Concept ID: 73211009")
mapping2 = Mapping(concept2, text2, vectorizer.get_embedding(text2), vectorizer.model_name)

repository.store_all([terminology, concept1, mapping1, concept2, mapping2])
```

3) Retrieve the closest mappings and their similarities for a given text:

```python texttomap = "Sugar sickness" # Semantically similar to "Diabetes mellitus (disorder)" embedding = vectorizer.getembedding(textto_map)

results = repository.getclosestmappings(embedding, similarities=True, limit=2)

for result in results: print(result) ```

output:

python snomed CT > Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder) | Similarity: 0.4735338091850281 snomed CT > Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder) | Similarity: 0.2003161907196045

You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to download & compute embeddings for SNOMED from ebi OLS can be found in datastew/scripts/olssnomedretrieval.py.

Embedding visualization

You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different language models. An example how to generate a t-sne plot is shown in datastew/scripts/tsne_visualization.py:

```python from datastew.embedding import Vectorizer from datastew.process.parsing import DataDictionarySource from datastew.visualisation import plot_embeddings

Variable and description refer to the corresponding column names in your excel sheet

datadictionarysource1 = DataDictionarySource("source1.xlsx", variablefield="var", descriptionfield="desc") datadictionarysource2 = DataDictionarySource("source2.xlsx", variablefield="var", descriptionfield="desc")

vectorizer = Vectorizer() plotembeddings([datadictionarysource1, datadictionarysource_2], vectorizer=vectorizer) ```

t-SNE plot

Owner

Name: Fraunhofer SCAI Bioinformatics Department
Login: SCAI-BIO
Kind: organization

Website: https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/
Repositories: 29
Profile: https://github.com/SCAI-BIO

Deparment of Bioinformatics at Fraunhofer SCAI

Citation (CITATION.cff)

cff-version: 1.2.0
title: datastew
type: software
message: If you use this software, please cite it as below.
license: Apache-2.0
language: en
authors:
  - given-names: Tim
    family-names: Adams
    email: tim.adams@scai.fraunhofer.de
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    orcid: https://orcid.org/0000-0002-2823-0102
  - given-names: Mehmet Can
    family-names: Ay
    email: mehmet.ay@scai.fraunhofer.de
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    orcid: https://orcid.org/0000-0002-2977-7695
repository-code: https://github.com/SCAI-BIO/datastew
repository-artifact: https://pypi.org/project/datastew
abstract: >
  Datastew is a Python library for intelligent data harmonization using
  Large Language Model (LLM) vector embeddings.
keywords:
  - data harmonization
  - data stewardship
  - large language models
  - LLM

preferred-citation:
  type: article
  authors:
    - family-names: Salimi
      given-names: Yasamin
    - family-names: Adams
      given-names: Tim
    - family-names: Ay
      given-names: Mehmet Can
    - family-names: Balabin
      given-names: Helena
    - family-names: Jacobs
      given-names: Marc
    - family-names: Hofmann-Apitius
      given-names: Martin
  title: Evaluating language model embeddings for Parkinson’s disease cohort harmonization using a novel manually curated variable mapping schema
  journal: Scientific Reports
  year: 2025
  doi: 10.1038/s41598-025-06447-2

references:
  - type: conference-paper
    title: INDEX — the Intelligent Data Steward Toolbox
    doi: 10.24406/publica-4577
  - type: poster
    title: INDEX — the Intelligent Data Steward Toolbox
    doi: 10.4126/FRL01-006472846

GitHub Events

Total

Create event: 76
Commit comment event: 3
Release event: 17
Issues event: 58
Watch event: 2
Delete event: 64
Member event: 1
Issue comment event: 50
Push event: 193
Pull request review event: 62
Pull request review comment event: 60
Pull request event: 116

Last Year

Create event: 76
Commit comment event: 3
Release event: 17
Issues event: 58
Watch event: 2
Delete event: 64
Member event: 1
Issue comment event: 50
Push event: 193
Pull request review event: 62
Pull request review comment event: 60
Pull request event: 116

Committers

Last synced: 11 months ago

All Time

Total Commits: 417
Total Committers: 3
Avg Commits per committer: 139.0
Development Distribution Score (DDS): 0.456

Past Year

Commits: 292
Committers: 3
Avg Commits per committer: 97.333
Development Distribution Score (DDS): 0.247

Top Committers

Name	Email	Commits
Mehmet Can Ay	m**y@g**m	227
TimAdams84	t**s@g**t	172
dependabot[bot]	4****]	18

Committer Domains (Top 20 + Academic)

gmx.net: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 49
Total pull requests: 127
Average time to close issues: 2 months
Average time to close pull requests: 4 days
Total issue authors: 3
Total pull request authors: 3
Average comments per issue: 0.61
Average comments per pull request: 0.29
Merged pull requests: 96
Bot issues: 0
Bot pull requests: 47

Past Year

Issues: 41
Pull requests: 113
Average time to close issues: 2 months
Average time to close pull requests: 3 days
Issue authors: 2
Pull request authors: 3
Average comments per issue: 0.54
Average comments per pull request: 0.32
Merged pull requests: 82
Bot issues: 0
Bot pull requests: 47

View more stats

Top Authors

Issue Authors

tiadams (38)
mehmetcanay (13)
shammimore (1)

Pull Request Authors

mehmetcanay (55)
dependabot[bot] (50)
tiadams (43)

Top Labels

Issue Labels

enhancement (21) refactoring (7) bug (5) testing (3) documentation (2) ci/cd (2) wontfix (1)

Pull Request Labels

dependencies (50) python (26) enhancement (23) bug (10) refactoring (9) documentation (6) ci/cd (3)

Packages

Total packages: 1
Total downloads:
- pypi 153 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 33
Total maintainers: 1

pypi.org: datastew

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

Homepage: https://github.com/SCAI-BIO/datastew
Documentation: https://github.com/SCAI-BIO/datastew#readme
License: Apache-2.0
Latest release: 0.6.0
published 11 months ago

Versions: 33
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 153 Last month

Rankings

Dependent packages count: 10.9%

Average: 36.1%

Dependent repos count: 61.3%

Maintainers (1)

tiadams

Last synced: 11 months ago

Dependencies

.github/workflows/python-publish.yaml actions

actions/checkout v3 composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite

.github/workflows/tests.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

requirements.txt pypi

SQLAlchemy *
aiofiles *
matplotlib *
numpy ==1.25.2
openai *
openpyxl *
pandas ==2.1.0
pip ==21.3.1
plotly *
pydantic *
python-dateutil ==2.8.2
python-dotenv *
pytz ==2023.3
scikit-learn ==1.3.2
scipy *
seaborn *
sentence-transformers ==2.3.1
setuptools ==60.2.0
six ==1.16.0
thefuzz *
tzdata ==2023.3
wheel ==0.37.1

setup.py pypi

SQLAlchemy *
aiofiles *
matplotlib *
numpy ==1.25.2
openai *
openpyxl *
pandas ==2.1.0
pip ==21.3.1
plotly *
pydantic *
python-dateutil ==2.8.2
python-dotenv *
python-multipart *
pytz ==2023.3
scikit-learn ==1.3.2
scipy *
seaborn *
sentence-transformers ==2.3.1
setuptools ==60.2.0
six ==1.16.0
thefuzz *
tzdata ==2023.3
wheel ==0.37.1

datastew

Science Score: 75.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

datastew

Installation

Usage

Harmonizing excel/csv resources

Variable and description refer to the corresponding column names in your excel sheet

Creating and using stored mappings

Embedding visualization

Variable and description refer to the corresponding column names in your excel sheet

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: datastew

Rankings

Maintainers (1)

Dependencies