pydvl_dataoob

fork of pyDVL

https://github.com/bastienzim/pydvl_dataoob

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

fork of pyDVL

Basic Info
  • Host: GitHub
  • Owner: BastienZim
  • License: lgpl-3.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 83.2 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

pyDVL Logo

A library for data valuation.

Build Status
License DOI

Documentation

pyDVL collects algorithms for Data Valuation and Influence Function computation.

Data Valuation is the task of estimating the intrinsic value of a data point wrt. the training set, the model and a scoring function. We currently implement methods from the following papers:

Influence Functions compute the effect that single points have on an estimator / model. We implement methods from the following papers:

Installation

To install the latest release use:

shell $ pip install pyDVL

You can also install the latest development version from TestPyPI:

shell pip install pyDVL --index-url https://test.pypi.org/simple/

For more instructions and information refer to Installing pyDVL in the documentation.

Usage

Influence Functions

For influence computation, follow these steps:

  1. Wrap your model and loss in a TorchTwiceDifferential object
  2. Compute influence factors by providing training data and inversion method

Using the conjugate gradient algorithm, this would look like: ```python import torch from torch import nn from torch.utils.data import DataLoader, TensorDataset

from pydvl.influence import TorchTwiceDifferentiable, compute_influences, InversionMethod

nnarchitecture = nn.Sequential( nn.Conv2d(inchannels=5, outchannels=3, kernelsize=3), nn.Flatten(), nn.Linear(27, 3), ) loss = nn.MSELoss() model = TorchTwiceDifferentiable(nn_architecture, loss)

inputdim = (5, 5, 5) outputdim = 3

traindataloader = DataLoader( TensorDataset(torch.rand((10, *inputdim)), torch.rand((10, outputdim))), batchsize=2, ) testdataloader = DataLoader( TensorDataset(torch.rand((5, *inputdim)), torch.rand((5, outputdim))), batchsize=1, )

influences = computeinfluences( model, trainingdata=traindataloader, testdata=testdataloader, progress=True, inversionmethod=InversionMethod.Cg, hessian_regularization=1e-1, maxiter=200, ) ```

Shapley Values

The steps required to compute values for your samples are:

  1. Create a Dataset object with your train and test splits.
  2. Create an instance of a SupervisedModel (basically any sklearn compatible predictor)
  3. Create a Utility object to wrap the Dataset, the model and a scoring function.
  4. Use one of the methods defined in the library to compute the values.

This is how it looks for Truncated Montecarlo Shapley, an efficient method for Data Shapley values:

```python from sklearn.datasets import loadbreastcancer from sklearn.linear_model import LogisticRegression from pydvl.value import *

data = Dataset.fromsklearn(loadbreastcancer(), trainsize=0.7) model = LogisticRegression() u = Utility(model, data, Scorer("accuracy", default=0.0)) values = computeshapleyvalues( u, mode=ShapleyMode.TruncatedMontecarlo, done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01), truncation=RelativeTruncation(u, rtol=0.01), ) ```

For more instructions and information refer to Getting Started in the documentation. We provide several examples with details on the algorithms and their applications.

Caching

pyDVL offers the possibility to cache certain results and speed up computation. It uses Memcached For that.

You can run it either locally or, using Docker:

shell docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest

You can read more in the documentation.

Contributing

Please open new issues for bugs, feature requests and extensions. You can read about the structure of the project, the toolchain and workflow in the guide for contributions.

License

pyDVL is distributed under LGPL-3.0. A complete version can be found in two files: here and here.

All contributions will be distributed under this license.

Owner

  • Name: Bastien Zimmermann
  • Login: BastienZim
  • Kind: user
  • Location: France
  • Company: Craft AI

R&D Engineer at Craft AI Interested in XAI

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: pyDVL
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: TransferLab team
    email: info+pydvl@appliedai.de
    affiliation: appliedAI Institute gGmbH
repository-code: 'https://github.com/aai-institute/pyDVL'
abstract: >-
  pyDVL is a library of stable implementations of algorithms
  for data valuation and influence function computation
keywords:
  - machine learning
  - data-centric AI
  - data valuation
  - influence function
  - Shapley value
  - data quality
  - Least core
  - Semi-values
  - Banzhaf index
license: LGPL-3.0
commit: 0e929ae121820b0014bf245da1b21032186768cb
version: v0.7.0
doi: 10.5281/zenodo.8311583
date-released: '2023-09-02'

GitHub Events

Total
Last Year