psmi

An implementation of Pointwise Sliced Mutual Information (PSMI) for machine learning

https://github.com/orailix/psmi

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

An implementation of Pointwise Sliced Mutual Information (PSMI) for machine learning

Basic Info

Host: GitHub
Owner: orailix
License: lgpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 16.7 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Numerical estimator of Pointwise Sliced Mutual Information (PSMI).

Jérémie Dentan, École Polytechnique

Setup

Our library is published on PyPI!

bash pip install psmi

Usage

We implement a class PSMI which should be used in a similar way to scikit-learn classes, with fit, transform and fit_transform methods.

We only implement PSMI between scalar feature and integer labels belonging to a finite number of classes. We use algorithm 1 in [1] to estimate PSMI. The only hyperparameter of this algorithm is the number of estimator (i.e. the number of directions samples to estimate PSMI).

We propose two approaches to compute PSMI.

Manual. You simply pass argument n_estimators with the desired value.
Automatic. In that cas, you pass n_estimators="auto" and an algorithm will be used to determine a suitable value for n_estimators.

Example

```python import numpy as np from psmi import PSMI

Generating data

nsamples, dim, nlabels = 100, 1024, 5 features = np.random.random((nsamples, dim)) labels = np.random.randint(nlabels, size=n_samples)

Manual number of estimator

psmiestimator = PSMI(nestimators=500) psmimean, psmistd, psmifull = psmiestimator.fittransform(features, labels) print(f"psmimean: {psmimean.shape}") # Should be (100,) print(f"psmistd: {psmistd.shape}") # Should be (100,) print(f"psmifull: {psmifull.shape}") # Should be (100,500) print(f"Num of estimator: {psmiestimator.n_estimators}") # Should be 500

Automatic number of estimator

psmiestimator = PSMI() psmimean, psmistd, psmifull = psmiestimator.fittransform(features, labels) print(f"psmimean: {psmimean.shape}") # Should be (100,) print(f"psmistd: {psmistd.shape}") # Should be (100,) print(f"psmifull: {psmifull.shape}") # Should be (100,) print(f"Num of estimator: {psmiestimator.nestimators}")

You can separate the fit and transform

ntest = 5 featurestest = np.random.random((ntest, dim)) labelstest = np.random.randint(nlabels, size=ntest) psmiestimator = PSMI() psmiestimator.fit(features, labels) psmimean, psmistd, psmifull = psmiestimator.transform(featurestest, labelstest) print(f"psmimean: {psmimean.shape}") # Should be (5,) print(f"psmistd: {psmistd.shape}") # Should be (5,) print(f"psmifull: {psmifull.shape}") # Should be (5,) print(f"Num of estimator: {psmiestimator.nestimators}") ```

Details on the `auto` mode

More specifically, we will iteratively add more and more estimators, in blocks of min_n_estimators. We stop this process when the PSMI of the elements that have the lowest PSMI minimally evolved between the current step and the one with half as many estimators.

More specifically, if lowest_psmi_quantile=0.05, we consider the 5% of elements with the lowest PSMI at current step. Then, we compare this value to the PSMI of theses elements using only the first int(n*milestone) blocks of estimators, where n is the current number of blocks that was added. Then we compare the absolute value of the variation divided by the PSMI at the current step. If it is below max_variation_of_the_lowest, we stop. Else, we add another block of min_n_estimators estimators.

For example, the default values corresponds to blocks of 500 estimators. We add blocks untill the 5% of elements with lowest PSMI have varied of less than 5% between the current step and the one with half as many blocks.

[1] Shelvia Wongso et al. Pointwise Sliced Mutual Information for Neural Network Explainability. IEEE International Symposium on Information Theory (ISIT). 2023. DOI: 10.1109/ISIT54713.2023.10207010

Contributing

You are welcome to submit pull requests! Please use pre-commit to correctly format your code:

bash pip install -r .github/dev-requirements.txt pre-commit install

Please test your code:

bash pytest

License and Copyright

Please cite this work as follows:

bibtex @misc{dentan_predicting_2024, title = {Predicting and analysing memorization within fine-tuned Large Language Models}, url = {https://arxiv.org/abs/2409.18858}, author = {Dentan, Jérémie and Buscaldi, Davide and Shabou, Aymen and Vanier, Sonia}, month = sep, year = {2024}, }

Acknowldgements

This work received financial support from Crédit Agricole SA through the research chair ”Trustworthy and responsible AI” with École Polytechnique.

Owner

Name: ORAILIX
Login: orailix
Kind: organization
Location: France

Repositories: 1
Profile: https://github.com/orailix

Research team focusing on Operation Research and Artificial Intelligence at LIX (Comuter Science lab of École Polytechnique, Paris)

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Predicting and analysing memorization within fine-tuned
  Large Language Models
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Jérémie
    family-names: Dentan
    orcid: 'https://orcid.org/0009-0001-5561-8030'
  - given-names: Davide
    family-names: Buscaldi
  - given-names: Aymen
    family-names: Shabou
  - given-names: Sonia
    family-names: Vanier
identifiers:
  - type: url
    value: 'https://arxiv.org/abs/2409.18858'
repository-code: 'https://github.com/orailix/predict_llm_memorization'
abstract: >-
  Large Language Models have received significant attention
  due to their abilities to solve a wide range of complex
  tasks. However these models memorize a significant
  proportion of their training data, posing a serious threat
  when disclosed at inference time. To mitigate this
  unintended memorization, it is crucial to understand what
  elements are memorized and why. Most existing works
  provide a posteriori explanations, which has a limited
  impact in practice. To address this gap, we propose a new
  approach based on sliced mutual information to detect
  memorized samples a priori. It is efficient from the early
  stages of training, and is readily adaptable to any
  classification task. Our method is supported by new
  theoretical results that we demonstrate, and requires a
  low computational budget. We obtain strong empirical
  results, paving the way for systematic inspection and
  protection of these vulnerable samples before memorization
  happens.
license: LGPL-3

GitHub Events

Total

Public event: 1

Last Year

Public event: 1

Packages

Total packages: 1
Total downloads:
- pypi 12 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 1
Total maintainers: 1

pypi.org: psmi

An implementation of Pointwise Sliced Mutual Information (PSMI) for machine learning

Homepage: https://github.com/orailix/psmi
Documentation: https://psmi.readthedocs.io/
License: LGPL-3
Latest release: 0.2.0
published about 1 year ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 12 Last month

Rankings

Dependent packages count: 9.2%

Average: 30.7%

Dependent repos count: 52.1%

Maintainers (1)

jdentan

Last synced: 9 months ago

Dependencies

.github/workflows/ci.yaml actions

actions/checkout v4 composite
actions/setup-python v4 composite

.github/dev-requirements.txt pypi

numpy >=1.18 development
pre-commit ==3.5.0 development
pytest ==8.3.3 development
torch * development
torchvision * development
tqmd * development

.github/py311-requirements.txt pypi

numpy >=2
pytest *
torch *
torchvision *
tqdm *

.github/py38-requirements.txt pypi

numpy ==1.18
pytest *
torch *
torchvision *
tqdm *

pyproject.toml pypi

numpy >=1.18

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

psmi

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Numerical estimator of Pointwise Sliced Mutual Information (PSMI).

Setup

Usage

Example

Generating data

Manual number of estimator

Automatic number of estimator

You can separate the fit and transform

Details on the `auto` mode

Contributing

License and Copyright

Acknowldgements

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: psmi

Rankings

Maintainers (1)

Dependencies

psmi

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Numerical estimator of Pointwise Sliced Mutual Information (PSMI).

Setup

Usage

Example

Generating data

Manual number of estimator

Automatic number of estimator

You can separate the fit and transform

Details on the auto mode

Contributing

License and Copyright

Acknowldgements

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: psmi

Rankings

Maintainers (1)

Dependencies

Details on the `auto` mode