https://github.com/fraunhoferportugal/pymdma

pymdma

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: sciencedirect.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Keywords

data-science machine-learning metrics python

Last synced: 10 months ago · JSON representation

Repository

pymdma

Basic Info

Host: GitHub
Owner: fraunhoferportugal
License: lgpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 28.6 MB

Statistics

Stars: 50
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 10

Topics

data-science machine-learning metrics python

Created over 1 year ago · Last pushed 10 months ago

Metadata Files

Readme Changelog Contributing License Security Authors

pyMDMA - Multimodal Data Metrics for Auditing real and synthetic datasets

Data auditing is essential for ensuring the reliability of machine learning models by maintaining the integrity of the datasets upon which these models rely. As synthetic data use increases to address data scarcity and privacy concerns, there is a growing demand for a robust auditing framework.

Existing repositories often lack comprehensive coverage across various modalities or validation types. This work introduces a dedicated library for data auditing, presenting a comprehensive suite of metrics designed for evaluating synthetic data. Additionally, it extends its focus to the quality assessment of input data, whether synthetic or real, across time series, tabular, and image modalities.

This library aims to serve as a unified and accessible resource for researchers, practitioners, and developers, enabling them to assess the quality and utility of their datasets. This initiative encourages collaborative contributions by open-sourcing the associated code, fostering a community-driven approach to advancing data auditing practices. This work is intended for publication in an open-source journal to facilitate widespread dissemination, adoption, and impact tracking within the scientific and technical community.

For more information check out the official documentation here.

Prerequisites

You will need:

pip
python (see pyproject.toml for full version)
anaconda or similar (recommended)
Git (developers)
Make and poetry (developers)
load environment variables from .env (developers)

1. Installing

You can either install this package via pip (if you want access to individual modules) or clone the repository (if you wish to contribute to the project or change the code in any way). Currently, the package supports the following modalities: image, tabular, and time_series.

You should install the package in a virtual environment to avoid conflicts with system packages. Please consult the official documentation for developing with virtual environments.

1.1 Installing via pip (recommended)

Before running any commands, make sure you have the latest versions of pip and setuptools installed. The package can be installed with the following command:

bash pip install pymdma

Depending on the data modality you want to use, you may need to install additional dependencies. The following commands will install the dependencies for each modality:

bash pip install pymdma[image] # image dependencies pip install pymdma[tabular] # tabular dependencies pip install pymdma[time_series] # time series dependencies pip install pymdma[all] # dependencies for all modalities

Choose the one(s) that best suits your needs.

Note: for a minimal installation without CUDA support, you can install the package without the cuda dependencies. This can be done by forcing pip to install torch from the CPU index with the --find-url=https://download.pytorch.org/whl/cpu/torch_stable.html command. You will not have access to the GPU-accelerated features.

1.1 Installing from source

You can install the package from source with the following command:

bash pip install "pymdma @ git+https://github.com/fraunhoferportugal/pymdma.git"

Depending on the data modality you want to use, you may need to install additional dependencies. The following commands will install the dependencies for each modality:

bash pip install "pymdma[image] @ git+https://github.com/fraunhoferportugal/pymdma.git" # image dependencies pip install "pymdma[tabular] @ git+https://github.com/fraunhoferportugal/pymdma.git" # tabular dependencies pip install "pymdma[time_series] @ git+https://github.com/fraunhoferportugal/pymdma.git" # time series dependencies

For a minimal installation, you can install the package without CUDA support by forcing pip to install torch from the CPU index with the --find-url command.

2. Execution Examples

The package provides a CLI interface for automatically evaluating folder datasets. You can also import the metrics for a specific modality and use them in your code. Before running any commands make sure the package was correctly installed.

2.1. Importing Modality Metrics

You can import the metrics for a specific modality and use them in your code. The following example shows how to import an image metric and use it to evaluate input images in terms of sharpness. Note that this metric only returns the sharpness value for each image (i.e. the instance- level value). The dataset-level value is none.

```python from pymdma.image.measures.input_val import Tenengrad import numpy as np

images = np.random.rand(10, 224, 224, 3) # 10 random RGB images of size 224x224

tenengrad = Tenengrad() # sharpness metric sharpness = tenengrad.compute(images) # compute on RGB images

get the instance level value (dataset level is None)

datasetlevel, instance_level = sharpness.value ```

For evaluating synthetic datasets, you also have access to the synthetic metrics. The following example shows the steps necessary to process and evaluate a synthetic dataset in terms of the feature metrics. We load one of the available feature extractors, extract the features from the images and then compute the precision and recall metrics for the synthetic dataset in relation to the reference dataset.

```python from pymdma.image.models.features import ExtractorFactory

testimagesref = Path("./data/test/image/synthesisval/reference") # real images testimagessynth = Path("./data/test/image/sythesisval/dataset") # synthetic images

Get image filenames

imagesref = list(testimagesref.glob("*.jpg")) imagessynth = list(testimagessynth.glob("*.jpg"))

Extract features from images

extractor = ExtractorFactory.modelfromname(name="vitb32") reffeatures = extractor.extractfeaturesfromfiles(imagesref) synthfeatures = extractor.extractfeaturesfromfiles(imagessynth) ```

Now you can calculate the Improved Precision and Recall of the synthetic dataset in relation to the reference dataset.

```python from pymdma.image.measures.synthesis_val import ImprovedPrecision, ImprovedRecall

ip = ImprovedPrecision() # Improved Precision metric ir = ImprovedRecall() # Improved Recall metric

Compute the metrics

ipresult = ip.compute(reffeatures, synthfeatures) irresult = ir.compute(reffeatures, synthfeatures)

Get the dataset and instance level values

precisiondataset, precisioninstance = ipresult.value recalldataset, recallinstance = irresult.value

Print the results

print(f"Precision: {precisiondataset:.2f} | Recall: {recalldataset:.2f}") print(f"Precision: {precisioninstance} | Recall: {recallinstance}") ```

You can find more examples of execution in the notebooks folder.

2.2. CLI Execution

To evaluate a dataset, you can use the CLI interface. The following command will list the available commands:

bash pymdma --help # list available commands

Following is an example of executing the evaluation of a synthetic dataset with regard to a reference dataset:

bash pymdma --modality image \ --validation_domain synth \ --reference_type dataset \ --evaluation_level dataset \ --reference_data data/test/image/synthesis_val/reference \ --target_data data/test/image/synthesis_val/dataset \ --batch_size 3 \ --metric_category feature \ --output_dir reports/image_metrics/

This will evaluate the synthetic dataset in the data/test/image/synthesis_val/dataset with regard to the reference dataset in data/test/image/synthesis_val/reference. The evaluation will be done at the dataset level and the report will be saved in the reports/image_metrics/ folder in JSON format. Only feature metrics will be computed for this evaluation.

Documentation

Full documentation is available here: docs/.

Contributing

Contributions of any kind are welcome. Please read CONTRIBUTING.md for details and the process for submitting pull requests to us.

If you change the code in any way, please follow the developer guidelines in DEVELOPER.md.

Changelog

See the Changelog for more information.

Security

Thank you for improving the security of the project, please see the Security Policy for more information.

License

This project is licensed under the terms of the LGPL-3.0 license. See LICENSE for more details.

Citation

If you publish work that uses pyMDMA, please cite pyMDMA as follows:

bibtex @article{softx2025pymdma, title = {pyMDMA: Multimodal data metrics for auditing real and synthetic datasets}, journal = {SoftwareX}, volume = {31}, pages = {102256}, year = {2025}, issn = {2352-7110}, doi = {https://doi.org/10.1016/j.softx.2025.102256}, url = {https://www.sciencedirect.com/science/article/pii/S2352711025002237}, author = {Ivo S. Façoco and Joana Rebelo and Pedro Matias and Nuno Bento and Ana C. Morgado and Ana Sampaio and Luís Rosado and Marília Barandas}, }

bibtex @misc{pymdma, title = {{pyMDMA}: Multimodal Data Metrics for Auditing real and synthetic datasets}, url = {https://github.com/fraunhoferportugal/pymdma}, author = {Fraunhofer AICOS}, license = {LGPL-3.0}, year = {2024}, }

Acknowledgments

This work was funded by AISym4Med project number 101095387, supported by the European Health and Digital Executive Agency (HADEA), granting authority under the powers delegated by the European Commission. More information on this project can be found here.

This work was supported by European funds through the Recovery and Resilience Plan, project ”Center for Responsible AI”, project number C645008882-00000055. Learn more about this project here.

Owner

Name: Associação Fraunhofer Portugal Research
Login: fraunhoferportugal
Kind: organization
Location: Porto, Portugal

Website: http://www.fraunhofer.pt
Repositories: 6
Profile: https://github.com/fraunhoferportugal

Associação Fraunhofer Portugal Research

GitHub Events

Total

Create event: 39
Release event: 9
Issues event: 10
Watch event: 46
Delete event: 29
Issue comment event: 5
Member event: 4
Push event: 96
Pull request review event: 20
Pull request event: 79
Fork event: 1

Last Year

Create event: 39
Release event: 9
Issues event: 10
Watch event: 46
Delete event: 29
Issue comment event: 5
Member event: 4
Push event: 96
Pull request review event: 20
Pull request event: 79
Fork event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 2
Total pull requests: 17
Average time to close issues: about 2 months
Average time to close pull requests: 35 minutes
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 17
Average time to close issues: about 2 months
Average time to close pull requests: 35 minutes
Issue authors: 2
Pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

crdsteixeira (2)
JoanaSofia (1)
waminos (1)

Pull Request Authors

ivo-facoco (36)
matiaspedro (4)
crdsteixeira (1)

Top Labels

Issue Labels

documentation (2) question (2) bug (1) enhancement (1)

Pull Request Labels

enhancement (22) documentation (18) bug (15)

Packages

Total packages: 1
Total downloads:
- pypi 87 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 9
Total maintainers: 3

pypi.org: pymdma

Multimodal Data Metrics for Auditing real and synthetic data

Homepage: https://github.com/fraunhoferportugal/pymdma
Documentation: https://pymdma.readthedocs.io/en/latest/
License: LGPL-3.0-or-later
Latest release: 0.2.2
published 11 months ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 87 Last month

Rankings

Dependent packages count: 10.0%

Average: 33.0%

Dependent repos count: 56.1%

Maintainers (3)

mbarandas fraunhoferportugal ivo.facoco

Last synced: 10 months ago

Dependencies

Dockerfile docker

python 3.9.19-slim build

pyproject.toml pypi

requirements/requirements-prod.txt pypi

accelerate ==0.24.1
aiohttp ==3.9.5
aiosignal ==1.3.1
alabaster ==0.7.16
annotated-types ==0.7.0
anyio ==3.7.1
asttokens ==2.4.1
async-timeout ==4.0.3
attrs ==23.2.0
babel ==2.15.0
blinker ==1.8.2
blis ==0.7.11
cachetools ==5.4.0
catalogue ==2.0.10
category-encoders ==2.6.3
certifi ==2024.7.4
cffi ==1.16.0
charset-normalizer ==3.3.2
click ==8.1.7
cloudpathlib ==0.18.1
cloudpickle ==3.0.0
colorama ==0.4.6
comm ==0.2.2
confection ==0.1.5
contourpy ==1.2.1
cycler ==0.12.1
cymem ==2.0.8
cython ==3.0.10
dash ==2.17.1
dash-core-components ==2.0.0
dash-html-components ==2.0.0
dash-table ==5.0.0
datasets ==2.20.0
decorator ==5.1.1
deprecation ==2.1.0
dill ==0.3.8
docutils ==0.21.2
exceptiongroup ==1.2.2
executing ==2.0.1
fastapi ==0.104.1
fastjsonschema ==2.20.0
filelock ==3.15.4
flask ==3.0.3
fonttools ==4.53.1
frozenlist ==1.4.1
fsspec ==2024.5.0
google-auth ==2.32.0
google-auth-oauthlib ==1.2.1
gspread ==6.1.2
gudhi ==3.10.1
h11 ==0.14.0
httplib2 ==0.22.0
huggingface-hub ==0.24.2
idna ==3.7
imagesize ==1.4.1
imbalanced-learn ==0.12.3
importlib-metadata ==8.2.0
importlib-resources ==6.4.0
intel-openmp ==2021.4.0
ipython ==8.18.0
ipywidgets ==8.1.3
itsdangerous ==2.2.0
jedi ==0.19.1
jinja2 ==3.1.4
joblib ==1.3.2
jsonschema ==4.23.0
jsonschema-specifications ==2023.12.1
jupyter-core ==5.7.2
jupyterlab-widgets ==3.0.11
kaleido ==0.2.1
kiwisolver ==1.4.5
langcodes ==3.4.0
language-data ==1.2.0
lightgbm ==4.4.0
lightning-utilities ==0.11.6
llvmlite ==0.43.0
loguru ==0.7.2
marisa-trie ==1.2.0
markdown-it-py ==3.0.0
markupsafe ==2.1.5
matplotlib ==3.7.5
matplotlib-inline ==0.1.7
mdurl ==0.1.2
mkl ==2021.4.0
mpmath ==1.3.0
multidict ==6.0.5
multiprocess ==0.70.16
murmurhash ==1.0.10
nbformat ==5.10.4
nest-asyncio ==1.6.0
networkx ==3.2.1
nltk ==3.8.1
numba ==0.60.0
numpy ==1.26.4
oauth2client ==4.1.3
oauthlib ==3.2.2
opencv-python ==4.10.0.84
orjson ==3.10.6
packaging ==24.1
pandas ==2.1.4
parso ==0.8.4
patsy ==0.5.6
pexpect ==4.9.0
pillow ==10.4.0
piq ==0.8.0
platformdirs ==4.2.2
plotly ==5.23.0
plotly-resampler ==0.10.0
pmdarima ==2.0.4
pot ==0.9.4
preshed ==3.0.9
prompt-toolkit ==3.0.36
psutil ==6.0.0
ptyprocess ==0.7.0
pure-eval ==0.2.3
pyarrow ==17.0.0
pyarrow-hotfix ==0.6
pyasn1 ==0.6.0
pyasn1-modules ==0.4.0
pycanon ==1.0.1.post2
pycaret ==3.3.2
pycocotools ==2.0.8
pycparser ==2.22
pydantic ==2.8.2
pydantic-core ==2.20.1
pygments ==2.18.0
pynndescent ==0.5.13
pyod ==2.0.1
pyparsing ==3.1.2
python-dateutil ==2.9.0.post0
python-dotenv ==1.0.1
python-multipart ==0.0.6
pytz ==2024.1
pywin32 ==306
pyyaml ==6.0.1
referencing ==0.35.1
regex ==2024.7.24
requests ==2.32.3
requests-oauthlib ==2.0.0
retrying ==1.3.4
rich ==13.7.1
rpds-py ==0.19.1
rsa ==4.9
safetensors ==0.4.3
schemdraw ==0.15
scikit-base ==0.7.8
scikit-learn ==1.4.2
scikit-plot ==0.3.7
scipy ==1.11.4
sentence-transformers ==2.7.0
setuptools ==71.1.0
shellingham ==1.5.4
six ==1.16.0
sktime ==0.26.0
smart-open ==7.0.4
sniffio ==1.3.1
snowballstemmer ==2.2.0
soundfile ==0.12.1
spacy ==3.7.5
spacy-legacy ==3.0.12
spacy-loggers ==1.0.5
sphinx ==7.4.7
sphinxcontrib-applehelp ==1.0.8
sphinxcontrib-devhelp ==1.0.6
sphinxcontrib-htmlhelp ==2.0.6
sphinxcontrib-jsmath ==1.0.1
sphinxcontrib-qthelp ==1.0.8
sphinxcontrib-serializinghtml ==1.1.10
srsly ==2.4.8
stack-data ==0.6.3
starlette ==0.27.0
statsmodels ==0.14.2
sympy ==1.13.1
tabulate ==0.9.0
tbats ==1.1.3
tbb ==2021.13.0
tenacity ==8.5.0
thinc ==8.2.5
threadpoolctl ==3.5.0
tokenizers ==0.19.1
tomli ==2.0.1
torch ==2.3.1
torch-fidelity ==0.3.0
torchmetrics ==1.3.2
torchvision ==0.18.1
tqdm ==4.66.4
traitlets ==5.14.3
transformers ==4.43.2
tsdownsample ==0.1.3
tsfel ==0.1.5
typer ==0.12.3
typing-extensions ==4.12.2
tzdata ==2024.1
umap-learn ==0.5.6
urllib3 ==2.2.2
uvicorn ==0.24.0.post1
wasabi ==1.1.3
wcwidth ==0.2.13
weasel ==0.4.1
werkzeug ==3.0.3
wfdb ==4.1.2
widgetsnbextension ==4.0.11
win32-setctime ==1.1.0
word2number ==1.1
wrapt ==1.16.0
wurlitzer ==3.1.1
xxhash ==3.4.1
yarl ==1.9.4
yellowbrick ==1.5
zipp ==3.19.2

.github/workflows/python-publish.yml actions

actions/checkout v4 composite
actions/download-artifact v4 composite
actions/setup-python v5 composite
actions/upload-artifact v4 composite
pypa/gh-action-pypi-publish release/v1.12 composite

environment.yml conda

pip 24.0.*
poetry 1.8.3.*
python 3.9.*
virtualenv 20.25.0.*

https://github.com/fraunhoferportugal/pymdma

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

pyMDMA - Multimodal Data Metrics for Auditing real and synthetic datasets

Prerequisites

1. Installing

1.1 Installing via pip (recommended)

1.1 Installing from source

2. Execution Examples

2.1. Importing Modality Metrics

get the instance level value (dataset level is None)

Get image filenames

Extract features from images

Compute the metrics

Get the dataset and instance level values

Print the results

2.2. CLI Execution

Documentation

Contributing

Changelog

Security

License

Citation

Acknowledgments

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pymdma

Rankings

Maintainers (3)

Dependencies