Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 10 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary
Repository
fork of pyDVL
Basic Info
- Host: GitHub
- Owner: BastienZim
- License: lgpl-3.0
- Language: Jupyter Notebook
- Default Branch: master
- Size: 83.2 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
A library for data valuation.
pyDVL collects algorithms for Data Valuation and Influence Function computation.
Data Valuation is the task of estimating the intrinsic value of a data point wrt. the training set, the model and a scoring function. We currently implement methods from the following papers:
- Castro, Javier, Daniel Gómez, and Juan Tejada. Polynomial Calculation of the Shapley Value Based on Sampling. Computers & Operations Research, Selected papers presented at the Tenth International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1, 2009): 1726–30.
- Ghorbani, Amirata, and James Zou. Data Shapley: Equitable Valuation of Data for Machine Learning. In International Conference on Machine Learning, 2242–51. PMLR, 2019.
- Wang, Tianhao, Yu Yang, and Ruoxi Jia. Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning. arXiv, 2022.
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
- Okhrati, Ramin, and Aldo Lipani. A Multilinear Sampling Algorithm to Estimate Shapley Values. In 25th International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE, 2021.
- Yan, T., & Procaccia, A. D. If You Like Shapley Then You’ll Love the Core. Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. Towards Efficient Data Valuation Based on the Shapley Value. In 22nd International Conference on Artificial Intelligence and Statistics, 1167–76. PMLR, 2019.
- Wang, Jiachen T., and Ruoxi Jia. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. arXiv, October 22, 2022.
- Kwon, Yongchan, and James Zou. Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
Influence Functions compute the effect that single points have on an estimator / model. We implement methods from the following papers:
- Koh, Pang Wei, and Percy Liang. Understanding Black-Box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, 70:1885–94. Sydney, Australia: PMLR, 2017.
- Naman Agarwal, Brian Bullins, and Elad Hazan, Second-Order Stochastic Optimization for Machine Learning in Linear Time, Journal of Machine Learning Research 18 (2017): 1-40.
Installation
To install the latest release use:
shell
$ pip install pyDVL
You can also install the latest development version from TestPyPI:
shell
pip install pyDVL --index-url https://test.pypi.org/simple/
For more instructions and information refer to Installing pyDVL in the documentation.
Usage
Influence Functions
For influence computation, follow these steps:
- Wrap your model and loss in a
TorchTwiceDifferentialobject - Compute influence factors by providing training data and inversion method
Using the conjugate gradient algorithm, this would look like: ```python import torch from torch import nn from torch.utils.data import DataLoader, TensorDataset
from pydvl.influence import TorchTwiceDifferentiable, compute_influences, InversionMethod
nnarchitecture = nn.Sequential( nn.Conv2d(inchannels=5, outchannels=3, kernelsize=3), nn.Flatten(), nn.Linear(27, 3), ) loss = nn.MSELoss() model = TorchTwiceDifferentiable(nn_architecture, loss)
inputdim = (5, 5, 5) outputdim = 3
traindataloader = DataLoader( TensorDataset(torch.rand((10, *inputdim)), torch.rand((10, outputdim))), batchsize=2, ) testdataloader = DataLoader( TensorDataset(torch.rand((5, *inputdim)), torch.rand((5, outputdim))), batchsize=1, )
influences = computeinfluences( model, trainingdata=traindataloader, testdata=testdataloader, progress=True, inversionmethod=InversionMethod.Cg, hessian_regularization=1e-1, maxiter=200, ) ```
Shapley Values
The steps required to compute values for your samples are:
- Create a
Datasetobject with your train and test splits. - Create an instance of a
SupervisedModel(basically any sklearn compatible predictor) - Create a
Utilityobject to wrap the Dataset, the model and a scoring function. - Use one of the methods defined in the library to compute the values.
This is how it looks for Truncated Montecarlo Shapley, an efficient method for Data Shapley values:
```python from sklearn.datasets import loadbreastcancer from sklearn.linear_model import LogisticRegression from pydvl.value import *
data = Dataset.fromsklearn(loadbreastcancer(), trainsize=0.7) model = LogisticRegression() u = Utility(model, data, Scorer("accuracy", default=0.0)) values = computeshapleyvalues( u, mode=ShapleyMode.TruncatedMontecarlo, done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01), truncation=RelativeTruncation(u, rtol=0.01), ) ```
For more instructions and information refer to Getting Started in the documentation. We provide several examples with details on the algorithms and their applications.
Caching
pyDVL offers the possibility to cache certain results and speed up computation. It uses Memcached For that.
You can run it either locally or, using Docker:
shell
docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
You can read more in the documentation.
Contributing
Please open new issues for bugs, feature requests and extensions. You can read about the structure of the project, the toolchain and workflow in the guide for contributions.
License
pyDVL is distributed under LGPL-3.0. A complete version can be found in two files: here and here.
All contributions will be distributed under this license.
Owner
- Name: Bastien Zimmermann
- Login: BastienZim
- Kind: user
- Location: France
- Company: Craft AI
- Repositories: 1
- Profile: https://github.com/BastienZim
R&D Engineer at Craft AI Interested in XAI
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: pyDVL
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: TransferLab team
email: info+pydvl@appliedai.de
affiliation: appliedAI Institute gGmbH
repository-code: 'https://github.com/aai-institute/pyDVL'
abstract: >-
pyDVL is a library of stable implementations of algorithms
for data valuation and influence function computation
keywords:
- machine learning
- data-centric AI
- data valuation
- influence function
- Shapley value
- data quality
- Least core
- Semi-values
- Banzhaf index
license: LGPL-3.0
commit: 0e929ae121820b0014bf245da1b21032186768cb
version: v0.7.0
doi: 10.5281/zenodo.8311583
date-released: '2023-09-02'