datastew
Python library for intelligent data stewardship using Large Language Model (LLM) embeddings
Science Score: 75.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
✓Institutional organization owner
Organization scai-bio has institutional domain (www.scai.fraunhofer.de) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Python library for intelligent data stewardship using Large Language Model (LLM) embeddings
Basic Info
- Host: GitHub
- Owner: SCAI-BIO
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://pypi.org/project/datastew/
- Size: 1.93 MB
Statistics
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 11
- Releases: 22
Topics
Metadata Files
README.md
datastew
Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.
Installation
bash
pip install datastew
Usage
Harmonizing excel/csv resources
You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a csv, tsv or excel file. An example how to match two separate variable descriptions is shown in datastew/scripts/mappingexcelexample.py:
```python from datastew.process.parsing import DataDictionarySource from datastew.process.mapping import mapdictionaryto_dictionary
Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variablefield="var", descriptionfield="desc") target = DataDictionarySource("target.xlxs", variablefield="var", descriptionfield="desc")
df = mapdictionarytodictionary(source, target) df.toexcel("result.xlxs") ```
The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches as well as a similarity measure per row.
Per default this will use the local MiniLM model, which may not yield the optimal performance. If you got an OpenAI API key it is possible to use their embedding API instead. To use your key, create a Vectorizer model and pass it to the function:
```python from datastew.embedding import Vectorizer from datastew.process.mapping import mapdictionaryto_dictionary
vectorizer = Vectorizer("text-embedding-ada-002", key="yourapikey") df = mapdictionaryto_dictionary(source, target, vectorizer=vectorizer) ```
Creating and using stored mappings
A simple example how to initialize an in memory database and compute a similarity mapping is shown in datastew/scripts/mappingdbexample.py:
1) Initialize the repository and embedding model:
```python
from datastew.embedding import Vectorizer
from datastew.repository import WeaviateRepository
from datastew.repository.model import Terminology, Concept, Mapping
repository = WeaviateRepository(mode='remote', path='localhost', port=8080)
vectorizer = Vectorizer()
# vectorizer = Vectorizer("text-embedding-ada-002", key="your_key") # Use this line for higher accuracy if you have an OpenAI API key
```
2) Create a baseline of data to map to in the initialized repository. Text gets attached to any unique concept of an existing or custom vocabulary or terminology namespace in the form of a mapping object containing the text, embedding, and the name of sentence embedder used. Multiple Mapping objects with textually different but semantically equal descriptions can point to the same Concept.
```python
terminology = Terminology("snomed CT", "SNOMED")
text1 = "Diabetes mellitus (disorder)"
concept1 = Concept(terminology, text1, "Concept ID: 11893007")
mapping1 = Mapping(concept1, text1, vectorizer.get_embedding(text1), vectorizer.model_name)
text2 = "Hypertension (disorder)"
concept2 = Concept(terminology, text2, "Concept ID: 73211009")
mapping2 = Mapping(concept2, text2, vectorizer.get_embedding(text2), vectorizer.model_name)
repository.store_all([terminology, concept1, mapping1, concept2, mapping2])
```
3) Retrieve the closest mappings and their similarities for a given text:
```python texttomap = "Sugar sickness" # Semantically similar to "Diabetes mellitus (disorder)" embedding = vectorizer.getembedding(textto_map)
results = repository.getclosestmappings(embedding, similarities=True, limit=2)
for result in results: print(result) ```
output:
python
snomed CT > Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder) | Similarity: 0.4735338091850281
snomed CT > Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder) | Similarity: 0.2003161907196045
You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to download & compute embeddings for SNOMED from ebi OLS can be found in datastew/scripts/olssnomedretrieval.py.
Embedding visualization
You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different language models. An example how to generate a t-sne plot is shown in datastew/scripts/tsne_visualization.py:
```python from datastew.embedding import Vectorizer from datastew.process.parsing import DataDictionarySource from datastew.visualisation import plot_embeddings
Variable and description refer to the corresponding column names in your excel sheet
datadictionarysource1 = DataDictionarySource("source1.xlsx", variablefield="var", descriptionfield="desc") datadictionarysource2 = DataDictionarySource("source2.xlsx", variablefield="var", descriptionfield="desc")
vectorizer = Vectorizer() plotembeddings([datadictionarysource1, datadictionarysource_2], vectorizer=vectorizer) ```

Owner
- Name: Fraunhofer SCAI Bioinformatics Department
- Login: SCAI-BIO
- Kind: organization
- Website: https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/
- Repositories: 29
- Profile: https://github.com/SCAI-BIO
Deparment of Bioinformatics at Fraunhofer SCAI
Citation (CITATION.cff)
cff-version: 1.2.0
title: datastew
type: software
message: If you use this software, please cite it as below.
license: Apache-2.0
language: en
authors:
- given-names: Tim
family-names: Adams
email: tim.adams@scai.fraunhofer.de
affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
orcid: https://orcid.org/0000-0002-2823-0102
- given-names: Mehmet Can
family-names: Ay
email: mehmet.ay@scai.fraunhofer.de
affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
orcid: https://orcid.org/0000-0002-2977-7695
repository-code: https://github.com/SCAI-BIO/datastew
repository-artifact: https://pypi.org/project/datastew
abstract: >
Datastew is a Python library for intelligent data harmonization using
Large Language Model (LLM) vector embeddings.
keywords:
- data harmonization
- data stewardship
- large language models
- LLM
preferred-citation:
type: article
authors:
- family-names: Salimi
given-names: Yasamin
- family-names: Adams
given-names: Tim
- family-names: Ay
given-names: Mehmet Can
- family-names: Balabin
given-names: Helena
- family-names: Jacobs
given-names: Marc
- family-names: Hofmann-Apitius
given-names: Martin
title: Evaluating language model embeddings for Parkinson’s disease cohort harmonization using a novel manually curated variable mapping schema
journal: Scientific Reports
year: 2025
doi: 10.1038/s41598-025-06447-2
references:
- type: conference-paper
title: INDEX — the Intelligent Data Steward Toolbox
doi: 10.24406/publica-4577
- type: poster
title: INDEX — the Intelligent Data Steward Toolbox
doi: 10.4126/FRL01-006472846
GitHub Events
Total
- Create event: 76
- Commit comment event: 3
- Release event: 17
- Issues event: 58
- Watch event: 2
- Delete event: 64
- Member event: 1
- Issue comment event: 50
- Push event: 193
- Pull request review event: 62
- Pull request review comment event: 60
- Pull request event: 116
Last Year
- Create event: 76
- Commit comment event: 3
- Release event: 17
- Issues event: 58
- Watch event: 2
- Delete event: 64
- Member event: 1
- Issue comment event: 50
- Push event: 193
- Pull request review event: 62
- Pull request review comment event: 60
- Pull request event: 116
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Mehmet Can Ay | m****y@g****m | 227 |
| TimAdams84 | t****s@g****t | 172 |
| dependabot[bot] | 4****] | 18 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 49
- Total pull requests: 127
- Average time to close issues: 2 months
- Average time to close pull requests: 4 days
- Total issue authors: 3
- Total pull request authors: 3
- Average comments per issue: 0.61
- Average comments per pull request: 0.29
- Merged pull requests: 96
- Bot issues: 0
- Bot pull requests: 47
Past Year
- Issues: 41
- Pull requests: 113
- Average time to close issues: 2 months
- Average time to close pull requests: 3 days
- Issue authors: 2
- Pull request authors: 3
- Average comments per issue: 0.54
- Average comments per pull request: 0.32
- Merged pull requests: 82
- Bot issues: 0
- Bot pull requests: 47
Top Authors
Issue Authors
- tiadams (38)
- mehmetcanay (13)
- shammimore (1)
Pull Request Authors
- mehmetcanay (55)
- dependabot[bot] (50)
- tiadams (43)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 153 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 33
- Total maintainers: 1
pypi.org: datastew
Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.
- Homepage: https://github.com/SCAI-BIO/datastew
- Documentation: https://github.com/SCAI-BIO/datastew#readme
- License: Apache-2.0
-
Latest release: 0.6.0
published 6 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- actions/setup-python v3 composite
- pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
- actions/checkout v3 composite
- actions/setup-python v3 composite
- SQLAlchemy *
- aiofiles *
- matplotlib *
- numpy ==1.25.2
- openai *
- openpyxl *
- pandas ==2.1.0
- pip ==21.3.1
- plotly *
- pydantic *
- python-dateutil ==2.8.2
- python-dotenv *
- pytz ==2023.3
- scikit-learn ==1.3.2
- scipy *
- seaborn *
- sentence-transformers ==2.3.1
- setuptools ==60.2.0
- six ==1.16.0
- thefuzz *
- tzdata ==2023.3
- wheel ==0.37.1
- SQLAlchemy *
- aiofiles *
- matplotlib *
- numpy ==1.25.2
- openai *
- openpyxl *
- pandas ==2.1.0
- pip ==21.3.1
- plotly *
- pydantic *
- python-dateutil ==2.8.2
- python-dotenv *
- python-multipart *
- pytz ==2023.3
- scikit-learn ==1.3.2
- scipy *
- seaborn *
- sentence-transformers ==2.3.1
- setuptools ==60.2.0
- six ==1.16.0
- thefuzz *
- tzdata ==2023.3
- wheel ==0.37.1