pydvl - v0.10.0

v0.10.0 - 💥📚🐞🆕 New valuation interface, improved docs, new methods, breaking changes and tons of improvements

After lots of work, bug-fixing, bug-introducing, fixing again, and a good measure of bike shedding, we bring a major update putting us closer to the final APIs. The main goals of this release were to improve usability, documentation, and extensibility.

We have added a new module pydvl.valuation. The pydvl.value module is deprecated and will be removed in the next release. The new interface allows for a more consistent and flexible way to define and use valuation methods. It also simplifies experimentation, manipulation of results and data, as well as parallelization.
We have many improvements to the influence module including several new methods and approximations.
The whole documentation has been improved and consolidated, with detailed method descriptions and examples. See pydvl.org.

Added

Simple result serialization to resume computation of values PR #666
Simple memory monitor / reporting PR #663
New stopping criterion MaxSamples PR #661
Introduced UtilityModel and two implementations IndicatorUtilityModel and DeepSetsUtilityModel for data utility learning PR #650
Introduced the concept of ResultUpdater in order to allow samplers to declare the proper strategy to use by valuations PR #641
Added Banzhaf precomputed values to some games. PR #641
Introduced new IndexIterations, for consistent usage across all PowersetSamplers PR #641
Added run_removal_experiment for easy removal experiments PR #636
Refactor Classwise Shapley valuation with the interfaces and sampler architecture PR #616
Refactor KNN Shapley values with the new interface PR #610 PR #645
Refactor MSR Banzhaf semivalues with the new sampler architecture. PR #605 PR #641
Refactor group-testing shapley values with new sampler architecture PR #602
Refactor least-core data valuation methods with more supported sampling methods and consistent interface. PR #580
Refactor Owen-Shapley valuation with new sampler architecture. Enable use of OwenSamplers with all semi-values PR #597 PR #641
New method InverseHarmonicMeanInfluence, implementation for the paper DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models PR #582
Add new backend implementations for influence computation to account for block-diagonal approximations PR #582
Extend DirectInfluence with block-diagonal and Gauss-Newton approximation PR #591
Extend LissaInfluence with block-diagonal and Gauss-Newton approximation PR #593
Extend NystroemSketchInfluence with block-diagonal and Gauss-Newton approximation PR #596
Extend ArnoldiInfluence with block-diagonal and Gauss-Newton approximation PR #598
Extend CgInfluence with block-diagonal and Gauss-Newton approximation PR #601

Fixed

Fixed show_warnings=False not being respected in subprocesses. Introduced suppress_warninigs decorator for more flexibility PR #647 PR #662
Fixed several bugs in diverse stopping criteria, including: iteration counts, computing completion, resetting, nested composition PR #641 PR #650
Fixed all weights of all samplers to ensure that mix-and-matching samplers and semi-value methods always works, for all possible combinations PR #641
Fixed a bug whereby progress bars would not report the last step and remain incomplete PR #641
Fixed the analysis of the adult dataset in the Data-OOB notebook PR #636
Replace np.float_ with np.float64 and np.alltrue with np.all, as the old aliases are removed in NumPy 2.0 PR #604
Fix a bug in pydvl.utils.numeric.random_subset where 1 - q was used instead of q as the probability of an element being sampled PR #597
Fix a bug in the calculation of variance estimates for MSR Banzhaf PR #605
Fix a bug in KNN Shapley values. See Issue 613 for details.
Backport the KNN Shapley fix to the value module PR #633

Changed

Slicing, comparing and setting of ValuationResult behave in a more natural and consistent way PR #660 PR #666
Switched all semi-value coefficients and sampler weights to log-space in order to avoid overflows PR #643
Updated and rewrote some of the MSR banzhaf notebook PR #641
Updated Least-Core notebook PR #641
Updated Shapley spotify notebook PR #628
Updated Data Utility notebook PR #650
Restructured and generalized StratifiedSampler to allow using heuristics, thus subsuming Variance-Reduced stratified sampling into a unified framework. Implemented the heuristics proposed in that paper PR #641
Uniformly distribute test points across processes for KNNShapley. Fail for GroupedDataset PR #632
Introduced the concept of logical vs data indices for Dataset, and GroupedDataset, fixing inconsistencies in how the latter operates on indices. Also, both now return objects of the same type when slicing. PR #631 PR #648
Use tighter bounds for the calculation of the minimal sample size that guarantees an epsilon-delta approximation in group testing (Jia et al. 2023) PR #602
Dropped black, isort and pylint from the CI pipeline, in favour of ruff PR #633
Breaking Changes
- Changed DataOOBValuation to only accept bagged models PR #636
- Dropped support for python 3.8 after EOL PR #633 - Rename parameter hessian_regularization of DirectInfluence to regularization and change the type annotation to allow for block-wise regularization parameters PR #591
- Rename parameter hessian_regularization of LissaInfluence to regularization and change the type annotation to allow for block-wise regularization parameters PR #593
- Remove parameter h0 from init of LissaInfluence PR #593
- Rename parameter hessian_regularization of NystroemSketchInfluence to regularization and change the type annotation to allow for block-wise regularization parameters PR #596
- Renaming of parameters of ArnoldiInfluence, hessian_regularization -> regularization (modify type annotation), rank_estimate -> rank PR #598
- Remove functions remove obsolete functions lanczos_low_rank_hessian_approximation, model_hessian_low_rank from influence.torch.functional PR #598
- Renaming of parameters of CgInfluence, hessian_regularization -> regularization (modify type annotation), pre_conditioner -> preconditioner, use_block_cg -> solve_simultaneously PR #601
- Remove parameter x0 from CgInfluence PR #601
- Rename module influence.torch.pre_conditioner -> influence.torch.preconditioner PR #601
- Refactor preconditioner:
  - renaming PreConditioner -> Preconditioner
  - fit to TensorOperator PR #601
  - Bumped zarr dependency to v3 PR #668

Full diff: https://github.com/aai-institute/pyDVL/compare/v0.9.2...v0.10.0

- Python
Published by mdbenito about 1 year ago

pydvl - v0.9.2

0.9.2 - 🏗 Bug fixes, logging improvement

Added

Add progress bars to the computation of LazyChunkSequence and NestedLazyChunkSequence PR #567
Add a device fixture for pytest, which depending on the availability and user input (pytest --with-cuda) resolves to cuda device PR #574

Fixed

Fixed logging issue in decorator log_duration PR #567
Fixed missing move of tensors to model device in EkfacInfluence implementation PR #570
Missing move to device of preconditioner in CgInfluence implementation PR #572
Raise a more specific error message, when a RunTimeError occurs in torch.linalg.eigh, so the user can check if it is related to a known issue PR #578
Fix an edge case (empty train data) in the test test_classwise_scorer_accuracies_manual_derivation, which resulted in undefined behavior (np.nan to int conversion with different results depending on OS) PR #579

Changed

Changed logging behavior of iterative methods LissaInfluence and CgInfluence to warn on not achieving desired tolerance within maxiter, add parameter warn_on_max_iteration to set the level for this information to logging.DEBUG PR #567

- Python
Published by schroedk about 2 years ago

pydvl - v0.9.1

0.9.1

Fixed

FutureWarning for ParallelConfig constantly raised without actually instantiating the object PR #562
Modify log level for implementations of TorchInfluenceFunctionModel
Add duration logging to output of SequentialCalculator

- Python
Published by schroedk about 2 years ago

pydvl - v0.9.0

🆕 New methods, better docs and bugfixes 📚🐞

Added

New method MSR Banzhaf with accompanying notebook, and new stopping criterion RankCorrelation PR #520
New method: NystroemSketchInfluence PR #504
New preconditioned block variant of conjugate gradient PR #507
Improvements to documentation: fixes, links, text, example gallery, LFS and more PR #532, PR #543
Glossary of data valuation and influence terms in the documentation PR #537
Documentation about writing notes for new features, changes or deprecations PR #557

Fixed

Bug in LissaInfluence, when not using CPU device PR #495
Memory issue with CgInfluence and ArnoldiInfluence PR #498
Raising specific error message with install instruction when trying to load pydvl.utils.cache.memcached without pymemcache installed. If pymemcache is available, all symbols from pydvl.utils.cache.memcached are available through pydvl.utils.cache PR #509

Changed

Add property model_dtype to instances of type TorchInfluenceFunctionModel
Bump versions of CI actions to avoid warnings PR #502
Add Python Version 3.11 to supported versions PR #510
Documentation improvements and cleanup PR #521, PR #522
Simplified parallel backend configuration PR #549

New Contributors

@jakobkruse1 made their first contribution in https://github.com/aai-institute/pyDVL/pull/510

Full Changelog: https://github.com/aai-institute/pyDVL/compare/v0.8.1...v0.9.0

- Python
Published by mdbenito about 2 years ago

pydvl - v0.8.1

🆕 New method and notebook, Games with exact shapley values, bug fixes and cleanup 🏗

Added

Implement new method: EkfacInfluence https://github.com/aai-institute/pyDVL/issues/451
New notebook to showcase ekfac for LLMs https://github.com/aai-institute/pyDVL/pull/483
Implemented exact games in Castro et al. 2009 and 2017 https://github.com/appliedAI-Initiative/pyDVL/pull/341

Fixed

Bug in using DaskInfluenceCalcualator with TorchnumpyConverter for single dimensional arrays https://github.com/aai-institute/pyDVL/pull/485
Fix implementations of to methods of TorchInfluenceFunctionModel implementations https://github.com/aai-institute/pyDVL/pull/487
Fixed bug with checking for converged values in semivalues https://github.com/appliedAI-Initiative/pyDVL/pull/341

Docs

Add applications of data valuation section, display examples more prominently, make all sections visible in table of contents, use mkdocs material cards in the home page https://github.com/aai-institute/pyDVL/pull/492

New Contributors

@opcode81 made their first contribution in https://github.com/aai-institute/pyDVL/pull/481
@dependabot made their first contribution in https://github.com/aai-institute/pyDVL/pull/455

Full Changelog: https://github.com/aai-institute/pyDVL/compare/v0.8.0...v0.8.1

- Python
Published by AnesBenmerzoug over 2 years ago

pydvl - v0.8.0

0.8.0 - 🆕 New interfaces, scaling computation, bug fixes and improvements 🎁

Added

New cache backends: InMemoryCacheBackend and DiskCacheBackend PR #458
New influence function interface InfluenceFunctionModel
Data parallel computation with DaskInfluenceCalculator PR #26
Sequential batch-wise computation and write to disk with SequentialInfluenceCalculator PR #377
Adapt notebooks to new influence abstractions PR #430

Changed

Refactor and simplify caching implementation PR #458
Simplify display of computation progress PR #466
Improve readme and explain better the examples PR #465
Simplify and improve tests, add CodeCov code coverage PR #429
Breaking Changes
- Removed compute_influences and all related code. Replaced by new InfluenceFunctionModel interface. Removed modules:
- influence.general
- influence.inversion
- influence.twice_differentiable
- influence.torch.torch_differentiable

Fixed

Import bug in README PR #457

Full Changelog: https://github.com/aai-institute/pyDVL/compare/v0.7.1...v0.8.0

- Python
Published by schroedk over 2 years ago

pydvl - v0.7.1

0.7.1 - 🆕 New methods, bug fixes and improvements for local tests 🐞🧪

Added

New method: Class-wise Shapley values PR #338
New method: Data-OOB by @BastienZim PR #426, PR #431
Added AntitheticPermutationSampler PR #439
Faster semi-value computation with per-index check of stopping criteria (optional) PR #437

Changed

No longer using docker within tests to start a memcached server PR #444
Using pytest-xdist for faster local tests PR #440
Improvements and fixes to notebooks PR #436
Refactoring of parallel module. Old imports will stop working in v0.9.0 PR #421

Fixed

Fix initialization of data_names in ValuationResult.zeros() PR #443

- Python
Published by mdbenito over 2 years ago

pydvl - v0.7.0

0.7.0 - 📚🆕 Documentation and IF overhaul, new methods and bug fixes 💥🐞

This is our first β release! We have worked hard to deliver improvements across the board, with a focus on documentation and usability. We have also reworked the internals of the influence module, improved parallelism and handling of randomness.

Added

Implemented solving the Hessian equation via spectral low-rank approximation PR #365
Enabled parallel computation for Leave-One-Out values PR #406
Added more abbreviations to documentation PR #415
Added seed to functions from pydvl.utils.numeric, pydvl.value.shapley and pydvl.value.semivalues. Introduced new type Seed and conversion function ensure_seed_sequence. PR #396

Changed

Replaced sphinx with mkdocs for documentation. Major overhaul of documentation PR #352
Made ray an optional dependency, relying on joblib as default parallel backend PR #408
Decoupled ray.init from ParallelConfig PR #373
Breaking Changes
- Signature change: return information about Hessian inversion from compute_influence_factors PR #375
- Major changes to IF interface and functionality. Foundation for a framework abstraction for IF computation. PR #278 PR #394
- Renamed semivalues to compute_generic_semivalues PR #413
- New joblib backend as default instead of ray. Simplify MapReduceJob. PR #355
- Bump torch dependency for influence package to 2.0 PR #365

Fixed

Fixes to parallel computation of generic semi-values: properly handle all samplers and stopping criteria, irrespective of parallel backend. PR #372
Optimises memory usage in IF calculation PR #375
Fix adding valuation results with overlapping indices and different lengths PR #370
Fixed bugs in conjugate gradient and linear_solve PR #358
Fix installation of dev requirements for Python3.10 PR #382
Improvements to IF documentation PR #371 ## New Contributors
@schroedk made their first contribution in https://github.com/aai-institute/pyDVL/pull/378

Full Changelog: https://github.com/aai-institute/pyDVL/compare/v0.6.1...v0.7.0

- Python
Published by mdbenito over 2 years ago

pydvl - v0.6.1

🏗 Bug fixes and minor improvements

Fix parsing keyword arguments of compute_semivalues dispatch function by @kosmitive in https://github.com/appliedAI-Initiative/pyDVL/pull/333
Create new RayExecutor class based on the concurrent.futures API, use the new class to fix an issue with Truncated Monte Carlo Shapley (TMCS) starting too many processes and dying, plus other small changes by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/329
Fix creation of GroupedDataset objects using the from_arrays and from_sklearn class methods by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/334
Fix release job not triggering on CI when a new tag is pushed by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/331
Added alias ApproShapley from Castro et al. 2009 for permutation Shapley by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/332

Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.6.0...v0.6.1

- Python
Published by AnesBenmerzoug about 3 years ago

pydvl - v0.6.0

🆕 New algorithms, cleanup and bug fixes 🏗

Fix/stopping checks by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/283
Fix Monte Carlo Least Core error when n_iterations < len(dataset) by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/281
Hide parallel backend in tmcs main function by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/293
Cosmetic changes to Dataset by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/290
Refactor/nicer imports by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/284
Fix StandardError stopping criterion by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/300
Remove unpackable decorator, use asdict() by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/233
Add burn-in param to AbsoluteStandardError by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/305
Remove default non-negativity constraint on least core subsidy by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/304
Close #280: Add py.typed by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/307
Minor docstring and cosmetic changes by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/317
Allow passing additional kwargs to Dataset class' classmethods by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/316
Semi-values and samplers by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/319
Remove bogus iter method. by @kosmitive in https://github.com/appliedAI-Initiative/pyDVL/pull/326
Improvements to ValuationResult by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/327

Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.5.0...v0.6.0

- Python
Published by mdbenito about 3 years ago

pydvl - v0.5.0

🛠️ Fixes, nicer interfaces and... more breaking changes 💥😒

Slow and steady does it

What’s changed

Fixed parallel and antithetic Owen sampling for Shapley values. Simplified and extended tests. https://github.com/appliedAI-Initiative/pyDVL/pull/267
Added Scorer class for a cleaner interface. Fixed minor bugs around Group-Testing Shapley, added more tests and switched to cvxpy for the solver. https://github.com/appliedAI-Initiative/pyDVL/pull/264
Generalised stopping criteria for valuation algorithms. Improved classes ValuationResult and Status with more operations. Some minor issues fixed. https://github.com/appliedAI-Initiative/pyDVL/pull/250
Fixed a bug whereby computeshapleyvalues would only spawn one process when using n_jobs=-1 and Monte Carlo methods. https://github.com/appliedAI-Initiative/pyDVL/pull/270
Bugfix in RayParallelBackend: wrong semantics for kwargs. https://github.com/appliedAI-Initiative/pyDVL/pull/268
Splitting of problem preparation and solution in Least-Core computation. Umbrella function for LC methods. https://github.com/appliedAI-Initiative/pyDVL/pull/257
Operations on ValuationResult and Status and some cleanup https://github.com/appliedAI-Initiative/pyDVL/pull/248
Bug fix and minor improvements: Fixes bug in TMCS with remote Ray cluster, raises an error for dummy sequential parallel backend with TMCS, clones model inside Utility before fitting by default, with flag clonebeforefit to disable it, catches all warnings in Utility when show_warnings is False. Adds Miner and Gloves toy games utilities https://github.com/appliedAI-Initiative/pyDVL/pull/247

Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.4.0...v0.5.0

- Python
Published by mdbenito over 3 years ago

pydvl - v0.4.0

🏭💥 New algorithms and more breaking changes

Least core, group testing, fixes to parellization and more documentation.

What's Changed

GH action to mark issues as stale PR #201
Disabled caching of Utility values as well as repeated evaluations by default PR #211
Test and officially support Python version 3.9 and 3.10 PR #208
Breaking change: Introduces a class ValuationResult to gather and inspect results from all valuation algorithms PR #214
Fixes bug in Influence calculation with multi-dimensional input and adds new example notebook PR #195
Documentation improvements PR #238 and PR #216
Breaking change: Passes the input to MapReduceJob at initialization, removes chunkify_inputs argument from MapReduceJob, removes n_runs argument from MapReduceJob, calls the parallel backend's put() method for each generated chunk in _chunkify(), renames ParallelConfig's num_workers attribute to n_local_workers, fixes a bug in MapReduceJob's chunkification when n_runs >= n_jobs, and defines a sequential parallel backend to run all jobs in the current thread PR #232
New method: Implements exact and monte carlo Least Core for data valuation, adds from_arrays() class method to the Dataset and GroupedDataset classes, adds extra_values argument to ValuationResult, adds compute_removal_score() and compute_random_removal_score() helper functions PR #237
New method: Group Testing Shapley for valuation, from Jia et al. 2019 PR #240
Fixes bug in ray initialization in RayParallelBackend class PR #239
Implements "Egalitarian Least Core", adds cvxpy as a dependency and uses it instead of scipy as optimizer PR #243
Notebook on using influence functions for Convolutional NNs PR #195

Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.3.0...v0.4.0

- Python
Published by mdbenito over 3 years ago

pydvl -

💥 Breaking changes

Simplified and fixed powerset sampling and testing PR #181
Simplified and fixed publishing to PyPI from CI PR #183
Fixed bug in release script and updated contributing docs PR #184
Added Pull Request template PR #185
Modified Pull Request template to automatically link PR to issue PR ##186
First implementation of Owen Sampling, squashed scores, better testing PR #194
Improved documentation on caching, Shapley, caveats of values, bibtex PR #194
Breaking change: Rearranging of modules to accommodate for new methods PR #194

- Python
Published by mdbenito over 3 years ago

pydvl - v0.2.0

What's Changed

Improve adding Notebooks to the Documentation by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/155
Fix preview release creation in CI by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/159
Add more badges to readme by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/162
Fix catching of ConnectionRefusedError in caching by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/170
Fix chunkification of data in MapReduceJob by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/176
Improvements to notebooks and API documentation by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/161
Fixed a bug in random matrix generation by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/161

Plus several minor changes and refactoring.

Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.1.0...v0.2.0

- Python
Published by mdbenito over 3 years ago

pydvl - v0.1.0

This is the very first release of pyDVL :tada:

Features

Data Valuation Methods:
- Leave-One-Out
- Influence Functions
- Shapley:
- Exact Permutation and Combinatorial
- Montecarlo Permutation and Combinatorial
- Truncated Montecarlo Permutation
Caching of results with Memcached
Parallelization of computations with Ray
Documentation
Notebooks containing examples of different use cases

If you find any bugs while using it, please feel free to open an issue.

Contributors: @AnesBenmerzoug,@mdbenito, @kosmitive, @Xuzzo

Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/commits/v0.1.0

- Python
Published by AnesBenmerzoug over 3 years ago

Recent Releases of pydvl

pydvl - v0.10.0

v0.10.0 - 💥📚🐞🆕 New valuation interface, improved docs, new methods, breaking changes and tons of improvements

Added

Fixed

Changed

pydvl - v0.9.2

0.9.2 - 🏗 Bug fixes, logging improvement

Added

Fixed

Changed

pydvl - v0.9.1

0.9.1

Fixed

pydvl - v0.9.0

🆕 New methods, better docs and bugfixes 📚🐞

Added

Fixed

Changed

New Contributors

pydvl - v0.8.1

🆕 New method and notebook, Games with exact shapley values, bug fixes and cleanup 🏗

Added

Fixed

Docs

New Contributors

pydvl - v0.8.0

0.8.0 - 🆕 New interfaces, scaling computation, bug fixes and improvements 🎁

Added

Changed

Fixed

pydvl - v0.7.1

0.7.1 - 🆕 New methods, bug fixes and improvements for local tests 🐞🧪

Added

Changed

Fixed

pydvl - v0.7.0

0.7.0 - 📚🆕 Documentation and IF overhaul, new methods and bug fixes 💥🐞

Added

Changed

Fixed

pydvl - v0.6.1

🏗 Bug fixes and minor improvements

pydvl - v0.6.0

🆕 New algorithms, cleanup and bug fixes 🏗

pydvl - v0.5.0

🛠️ Fixes, nicer interfaces and... more breaking changes 💥😒

What’s changed

pydvl - v0.4.0

🏭💥 New algorithms and more breaking changes

What's Changed

pydvl -

💥 Breaking changes

pydvl - v0.2.0

What's Changed

pydvl - v0.1.0

Features