https://github.com/alfa-group/code-representations-ml-brain

[NeurIPS 2022] "Convergent Representations of Computer Programs in Human and Artificial Neural Networks" by Shashank Srikant*, Benjamin Lipkin*, Anna A. Ivanova, Evelina Fedorenko, Una-May O'Reilly.

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary

Keywords

cognitive-neuroscience fmri-data-analysis language-models language-understanding programming-languages python representation-learning

Last synced: 10 months ago · JSON representation

Repository

[NeurIPS 2022] "Convergent Representations of Computer Programs in Human and Artificial Neural Networks" by Shashank Srikant*, Benjamin Lipkin*, Anna A. Ivanova, Evelina Fedorenko, Una-May O'Reilly.

Basic Info

Host: GitHub
Owner: ALFA-group
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 183 KB

Statistics

Stars: 5
Watchers: 5
Forks: 0
Open Issues: 0
Releases: 0

Topics

cognitive-neuroscience fmri-data-analysis language-models language-understanding programming-languages python representation-learning

Created almost 4 years ago · Last pushed over 3 years ago

https://github.com/ALFA-group/code-representations-ml-brain/blob/main/

# Convergent Representations of Computer Programs in Human and Artificial Neural Networks

Resources for the paper `Convergent Representations of Computer Programs in Human and Artificial Neural Networks` by Shashank Srikant*, Benjamin Lipkin*, Anna A. Ivanova, Evelina Fedorenko, Una-May O'Reilly.

Published in NeurIPS 2022: https://openreview.net/forum?id=AqexjBWRQFx

Citation:
```bibtex
@inproceedings{SrikantLipkin2022,
	title={Convergent Representations of Computer Programs in Human and Artificial Neural Networks},
	author={Shashank Srikant* and Ben Lipkin* and Anna A Ivanova and Evelina Fedorenko and {Una-May} {O'R}eilly},
	booktitle={Advances in Neural Information Processing Systems},
	editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
	year={2022},
	url={https://openreview.net/forum?id=AqexjBWRQFx}
}
```

The labs involved:

https://evlab.mit.edu/

https://alfagroup.csail.mit.edu/

For additional information, contact shash@mit.edu, lipkinb@mit.edu, or unamay@csail.mit.edu, evelina9@mit.edu.

Related material like slides, talk, a summary of our work etc. [are available here](https://shashank-srikant.github.io/notes/neurips22-brain/).   
Datasets and model checkpoints which this codebase downloads and analyzes are here: https://huggingface.co/datasets/benlipkin/braincode-neurips2022

## Overview
The goal of this work is to relate brain representations of code to (1) specific code properties and (2) representations of code produced by language models trained on code.  
In Experiment 1, we predict the different static and dynamic analysis metrics from the brain MRI recordings (each of dimension D_B) of 24 human subjects reading 72 unique Python programs (N) by training separate linear models for each subject and metric.  
In Experiment 2, we learn affine maps from brain representations to the corresponding representations generated by code language models (each of dimension D_M) on these 72 programs.

![Untitled](https://user-images.githubusercontent.com/14936946/201367011-54408e36-046f-43ca-a910-2fd1299a659b.png)

## Details

This pipeline supports several major functions.

-   **MVPA** (multivariate pattern analysis) evaluates decoding of **code properties** or **code model** representations from their respective **brain representations** within a collection of canonical **brain regions**.
-   **PRDA** (program representation decoding analysis) evaluates decoding of **code properties** from **code model** representations.

## Reproducing paper results

This package provides an automated build using [GNU Make](https://www.gnu.org/software/make/manual/make.html). A single pipeline is provided, which starts from an empty environment, and provides ready to use software.

```bash
make setup # see 'make help' for more info
```

Pipelines also exist to run core analyses and generate figures and tables.

To run all core experiments from the paper, the following command will suffice after setup:

```bash
make analysis
```

To regenerate tables and figures from the paper, run the following after completing the analyses:

```bash
make paper
```

Note - These commands will take ~8 hours to complete on a machine without GPU cards.

## Custom Analyses

The pipeline can also be used for custom analyses, via the following command line interface.

```bash
# basic examples
python braincode mvpa -f brain-MD -t task-structure # brain -> {task, model}
python braincode prda -f code-bert -t task-tokens # model -> task

# more complex example
python braincode mvpa -f brain-lang+brain-MD -t code-projection -d 64 -m SpearmanRho -p $BASE_PATH --score_only
# note how `+` operator can be used to join multiple representations via concatenation
# additional metrics are available in the `metrics.py` module
```

### Supported Brain Regions

-   `brain-MD` (Multiple Demand)
-   `brain-lang` (Language)
-   `brain-vis` (Visual)
-   `brain-aud` (Auditory)

### Supported Code Features

**Code Properties**

-   `test-code` (code vs. sentences)
-   `test-lang` (english vs. japanese)
-   `task-content` (math vs. str) ^\*datatype
-   `task-structure` (seq vs. for vs. if) ^{\*control flow}
-   `task-tokens` (# of tokens in program) ^{\*static analysis}
-   `task-lines` (# of runtime steps during execution) ^{\*dynamic analysis}
-   `task-bytes` (# of bytecode ops executed)
-   `task-nodes` (# of nodes in AST)
-   `task-halstead` (function of tokens, operations, vocabulary)
-   `task-cyclomatic` (function of program control flow graph)

**Code Models**

-   `code-projection` (presence of tokens)
-   `code-bow` (token frequency)
-   `code-tfidf` (token and document frequency)
-   `code-seq2seq`^{[1](https://github.com/IBM/pytorch-seq2seq)} (sequence modeling)
-   `code-xlnet`^{[2](https://arxiv.org/pdf/1906.08237.pdf)} (autoregressive LM)
-   `code-gpt2`^{[4](https://huggingface.co/microsoft/CodeGPT-small-py)} (autoregressive LM)
-   `code-bert`^{[5](https://arxiv.org/pdf/2002.08155.pdf)} (masked LM)
-   `code-roberta`^{[6](https://huggingface.co/huggingface/CodeBERTa-small-v1)} (masked LM)
-   `code-transformer`^{[3](https://arxiv.org/pdf/2103.11318.pdf)} (LM + structure learning)

## License

[![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT)

Owner

Name: Anyscale Learning For All (ALFA)
Login: ALFA-group
Kind: organization
Email: alfa-apply@csail.mit.edu
Location: Cambridge, MA, USA

Website: https://alfagroup.csail.mit.edu/
Repositories: 19
Profile: https://github.com/ALFA-group

Scalable machine learning technology, Adversarial AI, Evolutionary algorithms, and data science frameworks.

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Dependencies

Dockerfile docker

continuumio/miniconda3 latest build

requirements.txt pypi

astor ==0.8.1
datasets ==1.9.0
dill ==0.3.4
joblib ==0.14.1
line_profiler ==3.3.0
lxml ==4.8.0
matplotlib ==3.3.4
mne ==0.24.1
mypy ==0.941
numpy ==1.18.1
pylint ==2.13.4
pylint-exit ==1.2.0
pylint-json2html ==0.3.0
radon ==5.1.0
scikit_learn ==0.24.1
scipy ==1.4.1
tensorflow ==2.3.0
torch ==1.4.0
torchtext ==0.5.0
tqdm ==4.43.0
transformers ==3.1.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/alfa-group/code-representations-ml-brain

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/ALFA-group/code-representations-ml-brain/blob/main/

Owner

GitHub Events

Total

Last Year

Dependencies