rep_align

https://github.com/gpogoncheff/rep_align

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, ncbi.nlm.nih.gov
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: gpogoncheff
Language: Python
Default Branch: main
Size: 61.5 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme Citation

Towards Behavioral-Alignment via Representation Alignment

Galen Pogoncheff

AI Alignment Mini-Project, Spring 2024

Can Neural Encoding Strategies in Humans be Useful for AI Alignment?

Representational alignment is an emerging field at the intersection of neuroscience and artificial intelligence, focusing on the parallels between neural representations in the human brain and those in deep neural networks. In recent years, researchers in this field have observed that as state-of-the-art AI models have become increasingly capable at completing the tasks they were trained for, their learned latent representations learned have become increasingly predictive of neural activity in the brain of primates. This phenomenon has been observed in image models and language models, for instance, in which latent representations in these models has been observed to correlate with, and be predictive of, neural activity measured in brain areas of primates dedicated vision/language processing. Ultimately, the goal of representational alignment research, however, is not solely to reveal fun insights like this (joking aside, these insights have actually been very valuable in developing better models of the brain and revealing insights in neuroscience), but also to understand the intricate relationships between these representations and system behavior.

With this in mind, I ask can we make progress in behavioral alignment through representational alignment? That is, if we are to develop models that encode similar information to information encoded in human brains (or maybe more accurately, the areas of human brains associated with cognition and agency), can this contribute to behavior alignment? In this project, I take a first step towards investigating this big question by studying a much more specific question: can we encourage image models to learn more interpretable features by increasing their alignment with neural activity in the visual cortex (brain areas primariliy dedicated to visual processing) of humans?

Below, I walk though the problem setup, execution, and analysis used in this project to study this question. In this readme, I seek to keep the content concise, in effort to give you the gist of this work without eating into the time you may use to read the other 12 billion AI papers published today. If, by doing this, there are some details I skipped over that you are curious about, please feel welcome to message me, yell your question into the void, send me a letter... you do you.

Hypothesis

A quick backround before the hypothesis: The ventral stream is a processing pathway in the human visual cortex theorized to play a particularly important role in visual object recognition. Neurons in higher-order cortical areas in this processing pathway are thought to encode information relevant to shape, texture, object parts, faces, and plenty more. Although plenty of these neurons are thought to exhibit mixed selectivity (e.g., are polysemantic, responding to unrelated concepts), some suggest (such as in here for example) that there are many neurons that encode semantically meaningful (i.e., interpretable) information.

Hypothesis: By fine-tuning an image model to predict neural activity in high-order areas of the ventral stream, the image model will learn to encode features that tend to be more interpretable (semantically more meaningful).

Methods, Short and Sweet

(skip to the approach?)

Models used for analysis

In this preliminary work, I focus on investigating interpretability of simple CNN image models. Investigated models include: - ResNet-18 - More to come soon...

Data

CIFAR-10: A classic. 60000 32x32 colour images from 10 classes.
Natural Scenes Dataset: large-scale fMRI dataset consisting of whole-brain fMRI measurements of 8 humans while viewing images from MS COCO.

Evaluating Interpretability

Quantifying interpretability is an open challenge. In this project, I use an automated metric, the Interpretability Index (II) (David Klindt et al.) to quickly quantify how interpretable model features are. In short, this metric quantifies how interpretable a neuron is based on similarities among the neuron's Maximally Exciting Images (i.e., the images that cause maximal activation for the neuron). Multiple metrics can be used to evaluate MEI similarity, but here we focus on II-LPIPS, the pairwise learned perceptual image patch similarity loss (LPIPS) across MEIs, which was shown by Klindt et al. to correlated well with human measures of interpretability.

Is this metric perfect? Probably not. Is it insightful? I think so. The paper is pretty cool, check it out!

Approach

Train a randomly initialized model on CIFAR-10. We'll call this the "Baseline Model" (I know, how creative).
Make a copy of the baseline model, and fine tune it to predict neural activity from the Natural Scences Dataset (NSD). In effort to maintain task-performance on original task (since NSD doesnt have CIFAR labels), this fine-tuning was done following the Learning Without Forgetting Knowledge Distillation apporach (distilling knowledge from the baseline model). Let's call this new model the "Neural-Tuned Model".
Compare the two models based on original task performance and interpretability indices (IIs).

Some small details: - NSD neural activity was predicted using a linear prediction from the image-model's final feature representation (deeper layers in image models tend to be more predictive of higher cortical areas than earlier layers). More rigorous studies will also consider additional layers in the model - The NSD dataset contains activity in voxels from all across the brain. In this preliminary work, I focused on fine-tuning based on activity in visual area V4 (technically a mid-level area). Future work will study tuning with different brain areas and also multiple brain areas at once

Results

| Model | Condition | Brain Area Tuned | CIFAR-10 Val Acc | II-LPIPS (Lower Score Suggests Greater Interpretability) | |:--|:--:|:--:|:--:|:--:| | ResNet-18 | Baseline | None | $0.9325$ | $0.354 \pm 0.050$ | | ResNet-18 | Neural-Tuned | V4 | $0.8743$ | $0.322 \pm 0.040$ |

Notes: - II submetrics reported as mean +/- std dev across all neurons from ResNet-18 Layer 4 (the neuron activations that directly predict NSD neural activity) of the network - Difference in II-LPIPS is statistically significiant ($p < 0.01$, one-way anova)

So What?

So thats pretty neat! Quantitatively, according to II metrics, the neural tuned ResNet appears to be slightly more interpretable than the baseline model (future work on this repo will involve translating these II metrics into something a bit more meaningful).

Original task performance does drop quite a bit, but this may be something that could be managed with a larger tuning dataset (here, I am only using a subset of the NSD dataset), incorporating the original training data into the tuning process, and diligent training + hyperparameter selection.

I'm excited to apply this same technique to more images models (and maybe even on language models with similar techniques), and this will be important to understand if this approach generalizes (or if I just got lucky (unlucky false optimism?) with a ResNet-18 model trained on a toy task).

I think this is kind of cool though -- altering model characteristics (in a potentially favorable way in terms of AI safety (though of course, the tuned model is still not "read out the algorithm" interpretable)) by aligning its representations with the human brain. Could more be in store in terms of Human-AI alignment via alignment of their representations?

Running this code and neural-tuning your own models

Basline training image models on CIFAR-10: python /path/to/config.py (see /configs for example configuration files)

Fine-tuning with NSD data: Refactored code coming soon... (in the meantime, checkout tune_nsd.ipynb for a VERY rough notebook with tuning code)

More to come soon...

Whats next?

Stay tuned to this github repo for a lot more experiments to come. Notably, probing beyond interpretability and deeper into representational alignment-based behavior tuning.

A Parting Note

See something wrong? Want to contribute in making the next steps? Have some related work or ideas youd like to chat about or want a collaborator for? Want to chat with a new friend?

Reach out -- I'd love to hear your thoughts.

Interested in using these methods/code? @software{Pogoncheff_Towards_Behavioral-Alignment_via_2024, author = {Pogoncheff, Galen}, month = jun, title = {{Towards Behavioral-Alignment via Model-Brain Representation Alignment}}, url = {https://github.com/gpogoncheff/rep_align}, version = {0.0.0}, year = {2024} }

Owner

Name: Galen Pogoncheff
Login: gpogoncheff
Kind: user

Repositories: 13
Profile: https://github.com/gpogoncheff

Citation (CITATION.cff)

cff-version: 1.2.0
message: "Using these methods/code? A citation would be well appreciated."
authors:
- family-names: "Pogoncheff"
  given-names: "Galen"
  orcid: "https://orcid.org/0000-0001-6248-0992"
title: "Towards Behavioral-Alignment via Model-Brain Representation Alignment"
version: 0.0.0
date-released: 2024-06-10
url: "https://github.com/gpogoncheff/rep_align"

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 20
Total Committers: 1
Avg Commits per committer: 20.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 20
Committers: 1
Avg Commits per committer: 20.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Galen Pogoncheff	g**f@m**m	20

Committer Domains (Top 20 + Academic)

me.com: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

rep_align

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Towards Behavioral-Alignment via Representation Alignment

Can Neural Encoding Strategies in Humans be Useful for AI Alignment?

Hypothesis

Methods, Short and Sweet

Models used for analysis

Data

Evaluating Interpretability

Approach

Results

So What?

Running this code and neural-tuning your own models

Whats next?

A Parting Note

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies