intrinsic-properties

[ICLR 2024] Easy tools for measuring the label sharpness and intrinsic dimension of datasets and learned representations, which relate to model generalization and robustness.

https://github.com/mazurowski-lab/intrinsic-properties

Science Score: 41.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Keywords

generalization intrinsic-dimension label-sharpness medical-image-analysis pytorch

Last synced: 6 months ago · JSON representation ·

Repository

[ICLR 2024] Easy tools for measuring the label sharpness and intrinsic dimension of datasets and learned representations, which relate to model generalization and robustness.

Basic Info

Host: GitHub
Owner: mazurowski-lab
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://openreview.net/forum?id=ixP76Y33y1
Size: 3.36 MB

Statistics

Stars: 14
Watchers: 2
Forks: 2
Open Issues: 0
Releases: 0

Topics

generalization intrinsic-dimension label-sharpness medical-image-analysis pytorch

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

The Effect of Intrinsic Dataset Properties on Generalization (ICLR 2024)

By Nicholas Konz and Maciej Mazurowski.

Check out our related papers!

This is the code for our ICLR 2024 paper "The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images". Our paper shows how a neural network's generalization ability (test performance), adversarial robustness, etc., depends on measurable intrinsic properties of its training set, which we find can vary noticeably between imaging domains (e.g., natural images vs. medical images). Also check out our poster for more info.

Using this code, you can measure these intrinsic properties of your dataset: 1. The label sharpness $\hat{K}F$ of your dataset, our proposed metric which measures the extent to which images in the dataset can resemble each other while still having different labels. 2. The intrinsic dimension $d{\text{data}}$ of your dataset, i.e., the minimum number of degrees of freedom needed to describe it. 3. The intrinsic dimension $d_{\text{repr}}$ of the learned representations of some layer of a network, given the input dataset.

Citation

Please cite our ICLR 2024 paper if you use our code or reference our work (published version citation forthcoming): bib @inproceedings{konz2024intrinsicproperties, title={The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images}, author={Konz, Nicholas and Mazurowski, Maciej A}, booktitle={The Twelfth International Conference on Learning Representations (ICLR)}, year={2024}, url={https://openreview.net/forum?id=ixP76Y33y1} }

Quickstart

Code Usage/Installation

Clone this repository (run the command git clone https://github.com/mazurowski-lab/intrinsic-properties.git), then run the following commands in the main directory: bash pip3 install -r requirements.txt git clone https://github.com/ppope/dimensions.git cp utils/dimensions_init_fix.py dimensions/estimators/__init__.py

Measure intrinsic properties of your dataset (on GPU)

```python from datasetproperties import computelabelsharpness, computeintrinsicdatadim, computeintrinsic_reprdim

from torchvision.datasets import CIFAR10 from torchvision.transforms import ToTensor from torchvision.models import resnet18, ResNet18_Weights from torch.utils.data import Subset

first, load dataset

dataset = CIFAR10(root='data', download=True, transform=ToTensor()) classes = [0, 1] dataset = Subset(dataset, [i for i, s in enumerate(dataset) if s[1] in classes])

^ or any torch.utils.data.Dataset

compute label sharpness and intrinsic dimension of dataset

KF = computelabelsharpness(dataset) datadim = computeintrinsic_datadim(dataset)

compute intrinsic dimension of dataset representations in some layer of a neural network

model = resnet18(weights=ResNet18Weights.IMAGENET1KV1).to("cuda") layer = model.layer4 reprdim = computeintrinsicreprdim(dataset, model, layer)

print("label sharpness = {}".format(round(KF, 3))) print("data intrinsic dim = {}".format(int(datadim))) print("representation intrinsic dim = {}".format(int(reprdim))) ```

Output: label sharpness = 0.106 data intrinsic dim = 19 representation intrinsic dim = 24

A few notes about label sharpness

The label sharpness $\hat{K}F$ was formulated under the binary classification scenario, where data is either labeled with 0 or 1. However, it could be extended to the multi-class scenario by simply replacing the $|yj-yk|$ term in the numerator of Eq. 1 in the paper with the indicator function $1({yj\neq y_k})$, as suggested in Appendix A.1 of our paper. This is currently automatically done in our code.
When comparing the label sharpness $\hat{K}F$ of different datasets, use the same image resolution, channel count, and normalization range for all of them. As shown in our paper's appendix, $\hat{K}F$ is invariant to changes in these transformations besides all datasets’ $\hat{K}F$ values being multiplied by the same positive constant; i.e., the relative ranking of the $\hat{K}F$ of each dataset stays the same with respect to such transformations, as long as they are kept the same for all datasets.

Reproducing Our Paper's Results

Step 1: Dataset Setup

Natural image datasets: the natural image datasets used in our paper and code (ImageNet, CIFAR-10, SVHN and MNIST) are just the torchvision Datasets.
Medical image datasets: The medical image datasets are a bit more complicated to install, but step-by-step instructions can be found in step (1) of the tutorial for our previous paper.

Step 2: Code Usage

We provide all code used to reproduce the experiments in our paper: 1. train.py: run to train multiple models on the different datasets. 2. estimate_datadim_allmodels.py: run to estimate the intrinsic dimension of the training sets of multiple models. 3. estimate_reprdim_allmodels.py: run to estimate the intrinsic dimension of the learned representations of multiple models, for model layers of choice. 4. adv_atk_allmodels.py: run to evaluate the robustness of multiple models to adversarial attack.

Owner

Name: Mazurowski Lab
Login: mazurowski-lab
Kind: organization

Repositories: 7
Profile: https://github.com/mazurowski-lab

Citation (CITATION.md)

```bib
@inproceedings{konz2024intrinsicproperties,
title={The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images},
author={Konz, Nicholas and Mazurowski, Maciej A},
booktitle={The Twelfth International Conference on Learning Representations (ICLR)},
year={2024},
url={https://openreview.net/forum?id=ixP76Y33y1}
}
```

GitHub Events

Total

Watch event: 1
Fork event: 1

Last Year

Watch event: 1
Fork event: 1

Dependencies

requirements.txt pypi

h5py *
matplotlib *
numpy *
pandas *
pillow *
scikit-image *
scikit-learn *
torch *
torchvision *
tqdm *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science