pydvl_dataoob

fork of pyDVL

https://github.com/bastienzim/pydvl_dataoob

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 10 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

fork of pyDVL

Basic Info

Host: GitHub
Owner: BastienZim
License: lgpl-3.0
Language: Jupyter Notebook
Default Branch: master
Size: 83.2 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme Changelog Contributing License Citation

README.md

A library for data valuation.

Documentation

pyDVL collects algorithms for Data Valuation and Influence Function computation.

Data Valuation is the task of estimating the intrinsic value of a data point wrt. the training set, the model and a scoring function. We currently implement methods from the following papers:

Castro, Javier, Daniel Gómez, and Juan Tejada. Polynomial Calculation of the Shapley Value Based on Sampling. Computers & Operations Research, Selected papers presented at the Tenth International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1, 2009): 1726–30.
Ghorbani, Amirata, and James Zou. Data Shapley: Equitable Valuation of Data for Machine Learning. In International Conference on Machine Learning, 2242–51. PMLR, 2019.
Wang, Tianhao, Yu Yang, and Ruoxi Jia. Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning. arXiv, 2022.
Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
Okhrati, Ramin, and Aldo Lipani. A Multilinear Sampling Algorithm to Estimate Shapley Values. In 25th International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE, 2021.
Yan, T., & Procaccia, A. D. If You Like Shapley Then You’ll Love the Core. Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. Towards Efficient Data Valuation Based on the Shapley Value. In 22nd International Conference on Artificial Intelligence and Statistics, 1167–76. PMLR, 2019.
Wang, Jiachen T., and Ruoxi Jia. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. arXiv, October 22, 2022.
Kwon, Yongchan, and James Zou. Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.

Influence Functions compute the effect that single points have on an estimator / model. We implement methods from the following papers:

Koh, Pang Wei, and Percy Liang. Understanding Black-Box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, 70:1885–94. Sydney, Australia: PMLR, 2017.
Naman Agarwal, Brian Bullins, and Elad Hazan, Second-Order Stochastic Optimization for Machine Learning in Linear Time, Journal of Machine Learning Research 18 (2017): 1-40.

Installation

To install the latest release use:

shell $ pip install pyDVL

You can also install the latest development version from TestPyPI:

shell pip install pyDVL --index-url https://test.pypi.org/simple/

For more instructions and information refer to Installing pyDVL in the documentation.

Usage

Influence Functions

For influence computation, follow these steps:

Wrap your model and loss in a TorchTwiceDifferential object
Compute influence factors by providing training data and inversion method

Using the conjugate gradient algorithm, this would look like: ```python import torch from torch import nn from torch.utils.data import DataLoader, TensorDataset

from pydvl.influence import TorchTwiceDifferentiable, compute_influences, InversionMethod

nnarchitecture = nn.Sequential( nn.Conv2d(inchannels=5, outchannels=3, kernelsize=3), nn.Flatten(), nn.Linear(27, 3), ) loss = nn.MSELoss() model = TorchTwiceDifferentiable(nn_architecture, loss)

inputdim = (5, 5, 5) outputdim = 3

traindataloader = DataLoader( TensorDataset(torch.rand((10, *inputdim)), torch.rand((10, outputdim))), batchsize=2, ) testdataloader = DataLoader( TensorDataset(torch.rand((5, *inputdim)), torch.rand((5, outputdim))), batchsize=1, )

influences = computeinfluences( model, trainingdata=traindataloader, testdata=testdataloader, progress=True, inversionmethod=InversionMethod.Cg, hessian_regularization=1e-1, maxiter=200, ) ```

Shapley Values

The steps required to compute values for your samples are:

Create a Dataset object with your train and test splits.
Create an instance of a SupervisedModel (basically any sklearn compatible predictor)
Create a Utility object to wrap the Dataset, the model and a scoring function.
Use one of the methods defined in the library to compute the values.

This is how it looks for Truncated Montecarlo Shapley, an efficient method for Data Shapley values:

```python from sklearn.datasets import loadbreastcancer from sklearn.linear_model import LogisticRegression from pydvl.value import *

data = Dataset.fromsklearn(loadbreastcancer(), trainsize=0.7) model = LogisticRegression() u = Utility(model, data, Scorer("accuracy", default=0.0)) values = computeshapleyvalues( u, mode=ShapleyMode.TruncatedMontecarlo, done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01), truncation=RelativeTruncation(u, rtol=0.01), ) ```

For more instructions and information refer to Getting Started in the documentation. We provide several examples with details on the algorithms and their applications.

Caching

pyDVL offers the possibility to cache certain results and speed up computation. It uses Memcached For that.

You can run it either locally or, using Docker:

shell docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest

You can read more in the documentation.

Contributing

Please open new issues for bugs, feature requests and extensions. You can read about the structure of the project, the toolchain and workflow in the guide for contributions.

License

pyDVL is distributed under LGPL-3.0. A complete version can be found in two files: here and here.

All contributions will be distributed under this license.

Owner

Name: Bastien Zimmermann
Login: BastienZim
Kind: user
Location: France
Company: Craft AI

Repositories: 1
Profile: https://github.com/BastienZim

R&D Engineer at Craft AI Interested in XAI

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: pyDVL
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: TransferLab team
    email: info+pydvl@appliedai.de
    affiliation: appliedAI Institute gGmbH
repository-code: 'https://github.com/aai-institute/pyDVL'
abstract: >-
  pyDVL is a library of stable implementations of algorithms
  for data valuation and influence function computation
keywords:
  - machine learning
  - data-centric AI
  - data valuation
  - influence function
  - Shapley value
  - data quality
  - Least core
  - Semi-values
  - Banzhaf index
license: LGPL-3.0
commit: 0e929ae121820b0014bf245da1b21032186768cb
version: v0.7.0
doi: 10.5281/zenodo.8311583
date-released: '2023-09-02'

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science