Recent Releases of pydvl
pydvl - v0.10.0
v0.10.0 - π₯πππ New valuation interface, improved docs, new methods, breaking changes and tons of improvements
After lots of work, bug-fixing, bug-introducing, fixing again, and a good measure of bike shedding, we bring a major update putting us closer to the final APIs. The main goals of this release were to improve usability, documentation, and extensibility.
- We have added a new module
pydvl.valuation. Thepydvl.valuemodule is deprecated and will be removed in the next release. The new interface allows for a more consistent and flexible way to define and use valuation methods. It also simplifies experimentation, manipulation of results and data, as well as parallelization. - We have many improvements to the
influencemodule including several new methods and approximations. - The whole documentation has been improved and consolidated, with detailed method descriptions and examples. See pydvl.org.
Added
- Simple result serialization to resume computation of values PR #666
- Simple memory monitor / reporting PR #663
- New stopping criterion
MaxSamplesPR #661 - Introduced
UtilityModeland two implementationsIndicatorUtilityModelandDeepSetsUtilityModelfor data utility learning PR #650 - Introduced the concept of
ResultUpdaterin order to allow samplers to declare the proper strategy to use by valuations PR #641 - Added Banzhaf precomputed values to some games. PR #641
- Introduced new
IndexIterations, for consistent usage across allPowersetSamplersPR #641 - Added
run_removal_experimentfor easy removal experiments PR #636 - Refactor Classwise Shapley valuation with the interfaces and sampler architecture PR #616
- Refactor KNN Shapley values with the new interface PR #610 PR #645
- Refactor MSR Banzhaf semivalues with the new sampler architecture. PR #605 PR #641
- Refactor group-testing shapley values with new sampler architecture PR #602
- Refactor least-core data valuation methods with more supported sampling methods and consistent interface. PR #580
- Refactor Owen-Shapley valuation with new sampler architecture. Enable use of
OwenSamplerswith all semi-values PR #597 PR #641 - New method
InverseHarmonicMeanInfluence, implementation for the paperDataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion ModelsPR #582 - Add new backend implementations for influence computation to account for block-diagonal approximations PR #582
- Extend
DirectInfluencewith block-diagonal and Gauss-Newton approximation PR #591 - Extend
LissaInfluencewith block-diagonal and Gauss-Newton approximation PR #593 - Extend
NystroemSketchInfluencewith block-diagonal and Gauss-Newton approximation PR #596 - Extend
ArnoldiInfluencewith block-diagonal and Gauss-Newton approximation PR #598 - Extend
CgInfluencewith block-diagonal and Gauss-Newton approximation PR #601
Fixed
- Fixed
show_warnings=Falsenot being respected in subprocesses. Introducedsuppress_warninigsdecorator for more flexibility PR #647 PR #662 - Fixed several bugs in diverse stopping criteria, including: iteration counts, computing completion, resetting, nested composition PR #641 PR #650
- Fixed all weights of all samplers to ensure that mix-and-matching samplers and semi-value methods always works, for all possible combinations PR #641
- Fixed a bug whereby progress bars would not report the last step and remain incomplete PR #641
- Fixed the analysis of the adult dataset in the Data-OOB notebook PR #636
- Replace
np.float_withnp.float64andnp.alltruewithnp.all, as the old aliases are removed in NumPy 2.0 PR #604 - Fix a bug in
pydvl.utils.numeric.random_subsetwhere1 - qwas used instead ofqas the probability of an element being sampled PR #597 - Fix a bug in the calculation of variance estimates for MSR Banzhaf PR #605
- Fix a bug in KNN Shapley values. See Issue 613 for details.
- Backport the KNN Shapley fix to the
valuemodule PR #633
Changed
- Slicing, comparing and setting of
ValuationResultbehave in a more natural and consistent way PR #660 PR #666 - Switched all semi-value coefficients and sampler weights to log-space in order to avoid overflows PR #643
- Updated and rewrote some of the MSR banzhaf notebook PR #641
- Updated Least-Core notebook PR #641
- Updated Shapley spotify notebook PR #628
- Updated Data Utility notebook PR #650
- Restructured and generalized
StratifiedSamplerto allow using heuristics, thus subsuming Variance-Reduced stratified sampling into a unified framework. Implemented the heuristics proposed in that paper PR #641 - Uniformly distribute test points across processes for KNNShapley. Fail for
GroupedDatasetPR #632 - Introduced the concept of logical vs data indices for
Dataset, andGroupedDataset, fixing inconsistencies in how the latter operates on indices. Also, both now return objects of the same type when slicing. PR #631 PR #648 - Use tighter bounds for the calculation of the minimal sample size that guarantees an epsilon-delta approximation in group testing (Jia et al. 2023) PR #602
- Dropped black, isort and pylint from the CI pipeline, in favour of ruff PR #633
- Breaking Changes
- Changed
DataOOBValuationto only accept bagged models PR #636 - Dropped support for python 3.8 after EOL PR #633 - Rename parameter
hessian_regularizationofDirectInfluencetoregularizationand change the type annotation to allow for block-wise regularization parameters PR #591 - Rename parameter
hessian_regularizationofLissaInfluencetoregularizationand change the type annotation to allow for block-wise regularization parameters PR #593 - Remove parameter
h0from init ofLissaInfluencePR #593 - Rename parameter
hessian_regularizationofNystroemSketchInfluencetoregularizationand change the type annotation to allow for block-wise regularization parameters PR #596 - Renaming of parameters of
ArnoldiInfluence,hessian_regularization->regularization(modify type annotation),rank_estimate->rankPR #598 - Remove functions remove obsolete functions
lanczos_low_rank_hessian_approximation,model_hessian_low_rankfrominfluence.torch.functionalPR #598 - Renaming of parameters of
CgInfluence,hessian_regularization->regularization(modify type annotation),pre_conditioner->preconditioner,use_block_cg->solve_simultaneouslyPR #601 - Remove parameter
x0fromCgInfluencePR #601 - Rename module
influence.torch.pre_conditioner->influence.torch.preconditionerPR #601 - Refactor preconditioner:
- Changed
Full diff: https://github.com/aai-institute/pyDVL/compare/v0.9.2...v0.10.0
- Python
Published by mdbenito about 1 year ago
pydvl - v0.9.2
0.9.2 - π Bug fixes, logging improvement
Added
- Add progress bars to the computation of
LazyChunkSequenceandNestedLazyChunkSequencePR #567 - Add a device fixture for
pytest, which depending on the availability and user input (pytest --with-cuda) resolves to cuda device PR #574
Fixed
- Fixed logging issue in decorator
log_durationPR #567 - Fixed missing move of tensors to model device in
EkfacInfluenceimplementation PR #570 - Missing move to device of
preconditionerinCgInfluenceimplementation PR #572 - Raise a more specific error message, when a
RunTimeErroroccurs intorch.linalg.eigh, so the user can check if it is related to a known issue PR #578 - Fix an edge case (empty train data) in the test
test_classwise_scorer_accuracies_manual_derivation, which resulted in undefined behavior (np.nantointconversion with different results depending on OS) PR #579
Changed
- Changed logging behavior of iterative methods
LissaInfluenceandCgInfluenceto warn on not achieving desired tolerance withinmaxiter, add parameterwarn_on_max_iterationto set the level for this information tologging.DEBUGPR #567
- Python
Published by schroedk about 2 years ago
pydvl - v0.9.0
π New methods, better docs and bugfixes ππ
Added
- New method
MSR Banzhafwith accompanying notebook, and new stopping criterionRankCorrelationPR #520 - New method:
NystroemSketchInfluencePR #504 - New preconditioned block variant of conjugate gradient PR #507
- Improvements to documentation: fixes, links, text, example gallery, LFS and more PR #532, PR #543
- Glossary of data valuation and influence terms in the documentation PR #537
- Documentation about writing notes for new features, changes or deprecations PR #557
Fixed
- Bug in
LissaInfluence, when not using CPU device PR #495 - Memory issue with
CgInfluenceandArnoldiInfluencePR #498 - Raising specific error message with install instruction when trying to load
pydvl.utils.cache.memcachedwithoutpymemcacheinstalled. Ifpymemcacheis available, all symbols frompydvl.utils.cache.memcachedare available throughpydvl.utils.cachePR #509
Changed
- Add property
model_dtypeto instances of typeTorchInfluenceFunctionModel - Bump versions of CI actions to avoid warnings PR #502
- Add Python Version 3.11 to supported versions PR #510
- Documentation improvements and cleanup PR #521, PR #522
- Simplified parallel backend configuration PR #549
New Contributors
- @jakobkruse1 made their first contribution in https://github.com/aai-institute/pyDVL/pull/510
Full Changelog: https://github.com/aai-institute/pyDVL/compare/v0.8.1...v0.9.0
- Python
Published by mdbenito about 2 years ago
pydvl - v0.8.1
π New method and notebook, Games with exact shapley values, bug fixes and cleanup π
Added
- Implement new method: EkfacInfluence https://github.com/aai-institute/pyDVL/issues/451
- New notebook to showcase ekfac for LLMs https://github.com/aai-institute/pyDVL/pull/483
- Implemented exact games in Castro et al. 2009 and 2017 https://github.com/appliedAI-Initiative/pyDVL/pull/341
Fixed
- Bug in using DaskInfluenceCalcualator with TorchnumpyConverter for single dimensional arrays https://github.com/aai-institute/pyDVL/pull/485
- Fix implementations of to methods of TorchInfluenceFunctionModel implementations https://github.com/aai-institute/pyDVL/pull/487
- Fixed bug with checking for converged values in semivalues https://github.com/appliedAI-Initiative/pyDVL/pull/341
Docs
- Add applications of data valuation section, display examples more prominently, make all sections visible in table of contents, use mkdocs material cards in the home page https://github.com/aai-institute/pyDVL/pull/492
New Contributors
- @opcode81 made their first contribution in https://github.com/aai-institute/pyDVL/pull/481
- @dependabot made their first contribution in https://github.com/aai-institute/pyDVL/pull/455
Full Changelog: https://github.com/aai-institute/pyDVL/compare/v0.8.0...v0.8.1
- Python
Published by AnesBenmerzoug over 2 years ago
pydvl - v0.8.0
0.8.0 - π New interfaces, scaling computation, bug fixes and improvements π
Added
- New cache backends: InMemoryCacheBackend and DiskCacheBackend PR #458
- New influence function interface
InfluenceFunctionModel - Data parallel computation with
DaskInfluenceCalculatorPR #26 - Sequential batch-wise computation and write to disk with
SequentialInfluenceCalculatorPR #377 - Adapt notebooks to new influence abstractions PR #430
Changed
- Refactor and simplify caching implementation PR #458
- Simplify display of computation progress PR #466
- Improve readme and explain better the examples PR #465
- Simplify and improve tests, add CodeCov code coverage PR #429
- Breaking Changes
- Removed
compute_influencesand all related code. Replaced by newInfluenceFunctionModelinterface. Removed modules: - influence.general
- influence.inversion
- influence.twice_differentiable
- influence.torch.torch_differentiable
- Removed
Fixed
- Import bug in README PR #457
Full Changelog: https://github.com/aai-institute/pyDVL/compare/v0.7.1...v0.8.0
- Python
Published by schroedk over 2 years ago
pydvl - v0.7.1
0.7.1 - π New methods, bug fixes and improvements for local tests ππ§ͺ
Added
- New method: Class-wise Shapley values PR #338
- New method: Data-OOB by @BastienZim PR #426, PR #431
- Added
AntitheticPermutationSamplerPR #439 - Faster semi-value computation with per-index check of stopping criteria (optional) PR #437
Changed
- No longer using docker within tests to start a memcached server PR #444
- Using pytest-xdist for faster local tests PR #440
- Improvements and fixes to notebooks PR #436
- Refactoring of parallel module. Old imports will stop working in v0.9.0 PR #421
Fixed
- Fix initialization of
data_namesinValuationResult.zeros()PR #443
- Python
Published by mdbenito over 2 years ago
pydvl - v0.7.0
0.7.0 - ππ Documentation and IF overhaul, new methods and bug fixes π₯π
This is our first Ξ² release! We have worked hard to deliver improvements across
the board, with a focus on documentation and usability. We have also reworked
the internals of the influence module, improved parallelism and handling of
randomness.
Added
- Implemented solving the Hessian equation via spectral low-rank approximation PR #365
- Enabled parallel computation for Leave-One-Out values PR #406
- Added more abbreviations to documentation PR #415
- Added seed to functions from
pydvl.utils.numeric,pydvl.value.shapleyandpydvl.value.semivalues. Introduced new typeSeedand conversion functionensure_seed_sequence. PR #396
Changed
- Replaced sphinx with mkdocs for documentation. Major overhaul of documentation PR #352
- Made ray an optional dependency, relying on joblib as default parallel backend PR #408
- Decoupled
ray.initfromParallelConfigPR #373 - Breaking Changes
- Signature change: return information about Hessian inversion from
compute_influence_factorsPR #375 - Major changes to IF interface and functionality. Foundation for a framework abstraction for IF computation. PR #278 PR #394
- Renamed
semivaluestocompute_generic_semivaluesPR #413 - New
joblibbackend as default instead of ray. Simplify MapReduceJob. PR #355 - Bump torch dependency for influence package to 2.0 PR #365
- Signature change: return information about Hessian inversion from
Fixed
- Fixes to parallel computation of generic semi-values: properly handle all samplers and stopping criteria, irrespective of parallel backend. PR #372
- Optimises memory usage in IF calculation PR #375
- Fix adding valuation results with overlapping indices and different lengths PR #370
- Fixed bugs in conjugate gradient and
linear_solvePR #358 - Fix installation of dev requirements for Python3.10 PR #382
- Improvements to IF documentation PR #371 ## New Contributors
- @schroedk made their first contribution in https://github.com/aai-institute/pyDVL/pull/378
Full Changelog: https://github.com/aai-institute/pyDVL/compare/v0.6.1...v0.7.0
- Python
Published by mdbenito over 2 years ago
pydvl - v0.6.1
π Bug fixes and minor improvements
- Fix parsing keyword arguments of
compute_semivaluesdispatch function by @kosmitive in https://github.com/appliedAI-Initiative/pyDVL/pull/333 - Create new
RayExecutorclass based on the concurrent.futures API, use the new class to fix an issue with Truncated Monte Carlo Shapley (TMCS) starting too many processes and dying, plus other small changes by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/329 - Fix creation of GroupedDataset objects using the
from_arraysandfrom_sklearnclass methods by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/334 - Fix release job not triggering on CI when a new tag is pushed by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/331
- Added alias
ApproShapleyfrom Castro et al. 2009 for permutation Shapley by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/332
Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.6.0...v0.6.1
- Python
Published by AnesBenmerzoug about 3 years ago
pydvl - v0.6.0
π New algorithms, cleanup and bug fixes π
- Fix/stopping checks by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/283
- Fix Monte Carlo Least Core error when n_iterations < len(dataset) by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/281
- Hide parallel backend in tmcs main function by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/293
- Cosmetic changes to
Datasetby @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/290 - Refactor/nicer imports by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/284
- Fix StandardError stopping criterion by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/300
- Remove unpackable decorator, use asdict() by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/233
- Add burn-in param to AbsoluteStandardError by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/305
- Remove default non-negativity constraint on least core subsidy by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/304
- Close #280: Add py.typed by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/307
- Minor docstring and cosmetic changes by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/317
- Allow passing additional kwargs to Dataset class' classmethods by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/316
- Semi-values and samplers by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/319
- Remove bogus iter method. by @kosmitive in https://github.com/appliedAI-Initiative/pyDVL/pull/326
- Improvements to ValuationResult by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/327
Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.5.0...v0.6.0
- Python
Published by mdbenito about 3 years ago
pydvl - v0.5.0
π οΈ Fixes, nicer interfaces and... more breaking changes π₯π
Slow and steady does it
Whatβs changed
- Fixed parallel and antithetic Owen sampling for Shapley values. Simplified and extended tests. https://github.com/appliedAI-Initiative/pyDVL/pull/267
- Added Scorer class for a cleaner interface. Fixed minor bugs around Group-Testing Shapley, added more tests and switched to cvxpy for the solver. https://github.com/appliedAI-Initiative/pyDVL/pull/264
- Generalised stopping criteria for valuation algorithms. Improved classes ValuationResult and Status with more operations. Some minor issues fixed. https://github.com/appliedAI-Initiative/pyDVL/pull/250
- Fixed a bug whereby computeshapleyvalues would only spawn one process when using n_jobs=-1 and Monte Carlo methods. https://github.com/appliedAI-Initiative/pyDVL/pull/270
- Bugfix in RayParallelBackend: wrong semantics for kwargs. https://github.com/appliedAI-Initiative/pyDVL/pull/268
- Splitting of problem preparation and solution in Least-Core computation. Umbrella function for LC methods. https://github.com/appliedAI-Initiative/pyDVL/pull/257
- Operations on ValuationResult and Status and some cleanup https://github.com/appliedAI-Initiative/pyDVL/pull/248
- Bug fix and minor improvements: Fixes bug in TMCS with remote Ray cluster, raises an error for dummy sequential parallel backend with TMCS, clones model inside Utility before fitting by default, with flag clonebeforefit to disable it, catches all warnings in Utility when show_warnings is False. Adds Miner and Gloves toy games utilities https://github.com/appliedAI-Initiative/pyDVL/pull/247
Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.4.0...v0.5.0
- Python
Published by mdbenito over 3 years ago
pydvl - v0.4.0
ππ₯ New algorithms and more breaking changes
Least core, group testing, fixes to parellization and more documentation.
What's Changed
- GH action to mark issues as stale PR #201
- Disabled caching of Utility values as well as repeated evaluations by default PR #211
- Test and officially support Python version 3.9 and 3.10 PR #208
- Breaking change: Introduces a class ValuationResult to gather and inspect results from all valuation algorithms PR #214
- Fixes bug in Influence calculation with multi-dimensional input and adds new example notebook PR #195
- Documentation improvements PR #238 and PR #216
- Breaking change: Passes the input to
MapReduceJobat initialization, removeschunkify_inputsargument fromMapReduceJob, removesn_runsargument fromMapReduceJob, calls the parallel backend'sput()method for each generated chunk in_chunkify(), renames ParallelConfig'snum_workersattribute ton_local_workers, fixes a bug inMapReduceJob's chunkification whenn_runs>=n_jobs, and defines a sequential parallel backend to run all jobs in the current thread PR #232 - New method: Implements exact and monte carlo Least Core for data valuation, adds
from_arrays()class method to theDatasetandGroupedDatasetclasses, addsextra_valuesargument toValuationResult, addscompute_removal_score()andcompute_random_removal_score()helper functions PR #237 - New method: Group Testing Shapley for valuation, from Jia et al. 2019 PR #240
- Fixes bug in ray initialization in
RayParallelBackendclass PR #239 - Implements "Egalitarian Least Core", adds cvxpy as a dependency and uses it instead of scipy as optimizer PR #243
- Notebook on using influence functions for Convolutional NNs PR #195
Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.3.0...v0.4.0
- Python
Published by mdbenito over 3 years ago
pydvl -
π₯ Breaking changes
- Simplified and fixed powerset sampling and testing PR #181
- Simplified and fixed publishing to PyPI from CI PR #183
- Fixed bug in release script and updated contributing docs PR #184
- Added Pull Request template PR #185
- Modified Pull Request template to automatically link PR to issue PR ##186
- First implementation of Owen Sampling, squashed scores, better testing PR #194
- Improved documentation on caching, Shapley, caveats of values, bibtex PR #194
- Breaking change: Rearranging of modules to accommodate for new methods PR #194
- Python
Published by mdbenito over 3 years ago
pydvl - v0.2.0
What's Changed
- Improve adding Notebooks to the Documentation by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/155
- Fix preview release creation in CI by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/159
- Add more badges to readme by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/162
- Fix catching of ConnectionRefusedError in caching by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/170
- Fix chunkification of data in MapReduceJob by @AnesBenmerzoug in https://github.com/appliedAI-Initiative/pyDVL/pull/176
- Improvements to notebooks and API documentation by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/161
- Fixed a bug in random matrix generation by @mdbenito in https://github.com/appliedAI-Initiative/pyDVL/pull/161
Plus several minor changes and refactoring.
Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/compare/v0.1.0...v0.2.0
- Python
Published by mdbenito over 3 years ago
pydvl - v0.1.0
This is the very first release of pyDVL :tada:
Features
Data Valuation Methods:
- Leave-One-Out
- Influence Functions
- Shapley:
- Exact Permutation and Combinatorial
- Montecarlo Permutation and Combinatorial
- Truncated Montecarlo Permutation
Caching of results with Memcached
Parallelization of computations with Ray
Documentation
Notebooks containing examples of different use cases
If you find any bugs while using it, please feel free to open an issue.
Contributors: @AnesBenmerzoug,@mdbenito, @kosmitive, @Xuzzo
Full Changelog: https://github.com/appliedAI-Initiative/pyDVL/commits/v0.1.0
- Python
Published by AnesBenmerzoug over 3 years ago