psmi
An implementation of Pointwise Sliced Mutual Information (PSMI) for machine learning
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Repository
An implementation of Pointwise Sliced Mutual Information (PSMI) for machine learning
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Numerical estimator of Pointwise Sliced Mutual Information (PSMI).
Jérémie Dentan, École Polytechnique
Setup
Our library is published on PyPI!
bash
pip install psmi
Usage
We implement a class PSMI which should be used in a similar way to scikit-learn classes, with fit, transform and fit_transform methods.
We only implement PSMI between scalar feature and integer labels belonging to a finite number of classes. We use algorithm 1 in [1] to estimate PSMI. The only hyperparameter of this algorithm is the number of estimator (i.e. the number of directions samples to estimate PSMI).
We propose two approaches to compute PSMI.
Manual. You simply pass argument
n_estimatorswith the desired value.Automatic. In that cas, you pass
n_estimators="auto"and an algorithm will be used to determine a suitable value forn_estimators.
Example
```python import numpy as np from psmi import PSMI
Generating data
nsamples, dim, nlabels = 100, 1024, 5 features = np.random.random((nsamples, dim)) labels = np.random.randint(nlabels, size=n_samples)
Manual number of estimator
psmiestimator = PSMI(nestimators=500) psmimean, psmistd, psmifull = psmiestimator.fittransform(features, labels) print(f"psmimean: {psmimean.shape}") # Should be (100,) print(f"psmistd: {psmistd.shape}") # Should be (100,) print(f"psmifull: {psmifull.shape}") # Should be (100,500) print(f"Num of estimator: {psmiestimator.n_estimators}") # Should be 500
Automatic number of estimator
psmiestimator = PSMI()
psmimean, psmistd, psmifull = psmiestimator.fittransform(features, labels)
print(f"psmimean: {psmimean.shape}") # Should be (100,)
print(f"psmistd: {psmistd.shape}") # Should be (100,)
print(f"psmifull: {psmifull.shape}") # Should be (100,
You can separate the fit and transform
ntest = 5
featurestest = np.random.random((ntest, dim))
labelstest = np.random.randint(nlabels, size=ntest)
psmiestimator = PSMI()
psmiestimator.fit(features, labels)
psmimean, psmistd, psmifull = psmiestimator.transform(featurestest, labelstest)
print(f"psmimean: {psmimean.shape}") # Should be (5,)
print(f"psmistd: {psmistd.shape}") # Should be (5,)
print(f"psmifull: {psmifull.shape}") # Should be (5,
Details on the auto mode
More specifically, we will iteratively add more and more estimators, in
blocks of min_n_estimators. We stop this process when the PSMI of
the elements that have the lowest PSMI minimally evolved between the current
step and the one with half as many estimators.
More specifically, if lowest_psmi_quantile=0.05, we consider the 5%
of elements with the lowest PSMI at current step. Then, we compare this
value to the PSMI of theses elements using only the first int(n*milestone)
blocks of estimators, where n is the current number of blocks that was
added. Then we compare the absolute value of the variation divided by the
PSMI at the current step. If it is below max_variation_of_the_lowest,
we stop. Else, we add another block of min_n_estimators estimators.
For example, the default values corresponds to blocks of 500 estimators. We add blocks untill the 5% of elements with lowest PSMI have varied of less than 5% between the current step and the one with half as many blocks.
[1] Shelvia Wongso et al. Pointwise Sliced Mutual Information for Neural Network Explainability. IEEE International Symposium on Information Theory (ISIT). 2023. DOI: 10.1109/ISIT54713.2023.10207010
Contributing
You are welcome to submit pull requests! Please use pre-commit to correctly format your code:
bash
pip install -r .github/dev-requirements.txt
pre-commit install
Please test your code:
bash
pytest
License and Copyright
Copyright 2024-present Laboratoire d'Informatique de Polytechnique. This project is licensed under the GNU Lesser General Public License v3.0. See the LICENSE file for details.
Please cite this work as follows:
bibtex
@misc{dentan_predicting_2024,
title = {Predicting and analysing memorization within fine-tuned Large Language Models},
url = {https://arxiv.org/abs/2409.18858},
author = {Dentan, Jérémie and Buscaldi, Davide and Shabou, Aymen and Vanier, Sonia},
month = sep,
year = {2024},
}
Acknowldgements
This work received financial support from Crédit Agricole SA through the research chair ”Trustworthy and responsible AI” with École Polytechnique.
Owner
- Name: ORAILIX
- Login: orailix
- Kind: organization
- Location: France
- Repositories: 1
- Profile: https://github.com/orailix
Research team focusing on Operation Research and Artificial Intelligence at LIX (Comuter Science lab of École Polytechnique, Paris)
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Predicting and analysing memorization within fine-tuned
Large Language Models
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Jérémie
family-names: Dentan
orcid: 'https://orcid.org/0009-0001-5561-8030'
- given-names: Davide
family-names: Buscaldi
- given-names: Aymen
family-names: Shabou
- given-names: Sonia
family-names: Vanier
identifiers:
- type: url
value: 'https://arxiv.org/abs/2409.18858'
repository-code: 'https://github.com/orailix/predict_llm_memorization'
abstract: >-
Large Language Models have received significant attention
due to their abilities to solve a wide range of complex
tasks. However these models memorize a significant
proportion of their training data, posing a serious threat
when disclosed at inference time. To mitigate this
unintended memorization, it is crucial to understand what
elements are memorized and why. Most existing works
provide a posteriori explanations, which has a limited
impact in practice. To address this gap, we propose a new
approach based on sliced mutual information to detect
memorized samples a priori. It is efficient from the early
stages of training, and is readily adaptable to any
classification task. Our method is supported by new
theoretical results that we demonstrate, and requires a
low computational budget. We obtain strong empirical
results, paving the way for systematic inspection and
protection of these vulnerable samples before memorization
happens.
license: LGPL-3
GitHub Events
Total
- Public event: 1
Last Year
- Public event: 1
Packages
- Total packages: 1
-
Total downloads:
- pypi 12 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 1
- Total maintainers: 1
pypi.org: psmi
An implementation of Pointwise Sliced Mutual Information (PSMI) for machine learning
- Homepage: https://github.com/orailix/psmi
- Documentation: https://psmi.readthedocs.io/
- License: LGPL-3
-
Latest release: 0.2.0
published 10 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v4 composite
- actions/setup-python v4 composite
- numpy >=1.18 development
- pre-commit ==3.5.0 development
- pytest ==8.3.3 development
- torch * development
- torchvision * development
- tqmd * development
- numpy >=2
- pytest *
- torch *
- torchvision *
- tqdm *
- numpy ==1.18
- pytest *
- torch *
- torchvision *
- tqdm *
- numpy >=1.18