psmi

An implementation of Pointwise Sliced Mutual Information (PSMI) for machine learning

https://github.com/orailix/psmi

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

An implementation of Pointwise Sliced Mutual Information (PSMI) for machine learning

Basic Info
  • Host: GitHub
  • Owner: orailix
  • License: lgpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 16.7 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

Numerical estimator of Pointwise Sliced Mutual Information (PSMI).

Jérémie Dentan, École Polytechnique

Setup

Our library is published on PyPI!

bash pip install psmi

Usage

We implement a class PSMI which should be used in a similar way to scikit-learn classes, with fit, transform and fit_transform methods.

We only implement PSMI between scalar feature and integer labels belonging to a finite number of classes. We use algorithm 1 in [1] to estimate PSMI. The only hyperparameter of this algorithm is the number of estimator (i.e. the number of directions samples to estimate PSMI).

We propose two approaches to compute PSMI.

  1. Manual. You simply pass argument n_estimators with the desired value.

  2. Automatic. In that cas, you pass n_estimators="auto" and an algorithm will be used to determine a suitable value for n_estimators.

Example

```python import numpy as np from psmi import PSMI

Generating data

nsamples, dim, nlabels = 100, 1024, 5 features = np.random.random((nsamples, dim)) labels = np.random.randint(nlabels, size=n_samples)

Manual number of estimator

psmiestimator = PSMI(nestimators=500) psmimean, psmistd, psmifull = psmiestimator.fittransform(features, labels) print(f"psmimean: {psmimean.shape}") # Should be (100,) print(f"psmistd: {psmistd.shape}") # Should be (100,) print(f"psmifull: {psmifull.shape}") # Should be (100,500) print(f"Num of estimator: {psmiestimator.n_estimators}") # Should be 500

Automatic number of estimator

psmiestimator = PSMI() psmimean, psmistd, psmifull = psmiestimator.fittransform(features, labels) print(f"psmimean: {psmimean.shape}") # Should be (100,) print(f"psmistd: {psmistd.shape}") # Should be (100,) print(f"psmifull: {psmifull.shape}") # Should be (100,) print(f"Num of estimator: {psmiestimator.nestimators}")

You can separate the fit and transform

ntest = 5 featurestest = np.random.random((ntest, dim)) labelstest = np.random.randint(nlabels, size=ntest) psmiestimator = PSMI() psmiestimator.fit(features, labels) psmimean, psmistd, psmifull = psmiestimator.transform(featurestest, labelstest) print(f"psmimean: {psmimean.shape}") # Should be (5,) print(f"psmistd: {psmistd.shape}") # Should be (5,) print(f"psmifull: {psmifull.shape}") # Should be (5,) print(f"Num of estimator: {psmiestimator.nestimators}") ```

Details on the auto mode

More specifically, we will iteratively add more and more estimators, in blocks of min_n_estimators. We stop this process when the PSMI of the elements that have the lowest PSMI minimally evolved between the current step and the one with half as many estimators.

More specifically, if lowest_psmi_quantile=0.05, we consider the 5% of elements with the lowest PSMI at current step. Then, we compare this value to the PSMI of theses elements using only the first int(n*milestone) blocks of estimators, where n is the current number of blocks that was added. Then we compare the absolute value of the variation divided by the PSMI at the current step. If it is below max_variation_of_the_lowest, we stop. Else, we add another block of min_n_estimators estimators.

For example, the default values corresponds to blocks of 500 estimators. We add blocks untill the 5% of elements with lowest PSMI have varied of less than 5% between the current step and the one with half as many blocks.

[1] Shelvia Wongso et al. Pointwise Sliced Mutual Information for Neural Network Explainability. IEEE International Symposium on Information Theory (ISIT). 2023. DOI: 10.1109/ISIT54713.2023.10207010

Contributing

You are welcome to submit pull requests! Please use pre-commit to correctly format your code:

bash pip install -r .github/dev-requirements.txt pre-commit install

Please test your code:

bash pytest

License and Copyright

Copyright 2024-present Laboratoire d'Informatique de Polytechnique. This project is licensed under the GNU Lesser General Public License v3.0. See the LICENSE file for details.

Please cite this work as follows:

bibtex @misc{dentan_predicting_2024, title = {Predicting and analysing memorization within fine-tuned Large Language Models}, url = {https://arxiv.org/abs/2409.18858}, author = {Dentan, Jérémie and Buscaldi, Davide and Shabou, Aymen and Vanier, Sonia}, month = sep, year = {2024}, }

Acknowldgements

This work received financial support from Crédit Agricole SA through the research chair ”Trustworthy and responsible AI” with École Polytechnique.

Owner

  • Name: ORAILIX
  • Login: orailix
  • Kind: organization
  • Location: France

Research team focusing on Operation Research and Artificial Intelligence at LIX (Comuter Science lab of École Polytechnique, Paris)

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Predicting and analysing memorization within fine-tuned
  Large Language Models
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Jérémie
    family-names: Dentan
    orcid: 'https://orcid.org/0009-0001-5561-8030'
  - given-names: Davide
    family-names: Buscaldi
  - given-names: Aymen
    family-names: Shabou
  - given-names: Sonia
    family-names: Vanier
identifiers:
  - type: url
    value: 'https://arxiv.org/abs/2409.18858'
repository-code: 'https://github.com/orailix/predict_llm_memorization'
abstract: >-
  Large Language Models have received significant attention
  due to their abilities to solve a wide range of complex
  tasks. However these models memorize a significant
  proportion of their training data, posing a serious threat
  when disclosed at inference time. To mitigate this
  unintended memorization, it is crucial to understand what
  elements are memorized and why. Most existing works
  provide a posteriori explanations, which has a limited
  impact in practice. To address this gap, we propose a new
  approach based on sliced mutual information to detect
  memorized samples a priori. It is efficient from the early
  stages of training, and is readily adaptable to any
  classification task. Our method is supported by new
  theoretical results that we demonstrate, and requires a
  low computational budget. We obtain strong empirical
  results, paving the way for systematic inspection and
  protection of these vulnerable samples before memorization
  happens.
license: LGPL-3

GitHub Events

Total
  • Public event: 1
Last Year
  • Public event: 1

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 12 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
pypi.org: psmi

An implementation of Pointwise Sliced Mutual Information (PSMI) for machine learning

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 12 Last month
Rankings
Dependent packages count: 9.2%
Average: 30.7%
Dependent repos count: 52.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/ci.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v4 composite
.github/dev-requirements.txt pypi
  • numpy >=1.18 development
  • pre-commit ==3.5.0 development
  • pytest ==8.3.3 development
  • torch * development
  • torchvision * development
  • tqmd * development
.github/py311-requirements.txt pypi
  • numpy >=2
  • pytest *
  • torch *
  • torchvision *
  • tqdm *
.github/py38-requirements.txt pypi
  • numpy ==1.18
  • pytest *
  • torch *
  • torchvision *
  • tqdm *
pyproject.toml pypi
  • numpy >=1.18