proteinworkshop

Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)

https://github.com/a-r-j/proteinworkshop

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 14 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, biorxiv.org, sciencedirect.com, wiley.com, nature.com, science.org, zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.5%) to scientific vocabulary

Keywords

benchmark dataset deep-learning lightning pretraining protein protein-structure pytorch

Keywords from Contributors

interactome molecule structural-biology rna protein-data-bank graph-neural-networks dgl drug-discovery computational-biology bioinformatics

Last synced: 6 months ago · JSON representation ·

Repository

Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)

Basic Info

Host: GitHub
Owner: a-r-j
License: mit
Language: Python
Default Branch: main
Homepage: https://proteins.sh/
Size: 21.2 MB

Statistics

Stars: 252
Watchers: 8
Forks: 22
Open Issues: 7
Releases: 6

Topics

benchmark dataset deep-learning lightning pretraining protein protein-structure pytorch

Created almost 3 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License Citation

Protein Workshop

Overview of the Protein Workshop

Documentation

This repository provides the code for the protein structure representation learning benchmark detailed in the paper Evaluating Representation Learning on the Protein Structure Universe (ICLR 2024).

In the benchmark, we implement numerous featurisation schemes, datasets for self-supervised pre-training and downstream evaluation, pre-training tasks, and auxiliary tasks.

The benchmark can be used as a working template for a protein representation learning research project, a library of drop-in components for use in your projects, or as a CLI tool for quickly running protein representation learning evaluation and pre-training configurations.

Processed datasets and pre-trained weights are made available. Downloading datasets is not required; upon first run all datasets will be downloaded and processed from their respective source.

Configuration files to run the experiments described in the manuscript are provided in the proteinworkshop/config/sweeps/ directory.

Protein Workshop

Installation

Below, we outline how one may set up a virtual environment for proteinworkshop. Note that these installation instructions currently target Linux-like systems with NVIDIA CUDA support. Note that Windows and macOS are currently not officially supported.

From PyPI

proteinworkshop is available for install from PyPI. This enables training of specific configurations via the CLI or using individual components from the benchmark, such as datasets, featurisers, or transforms, as drop-ins to other projects. Make sure to install PyTorch (specifically version 2.1.2 or newer) using its official pip installation instructions, with CUDA support as desired.

```bash

install `proteinworkshop` from PyPI

pip install proteinworkshop

install PyTorch Geometric using the (now-installed) CLI

workshop install pyg

set a custom data directory for file downloads; otherwise, all data will be downloaded to `site-packages`

export DATAPATH="where/you/want/data/" # e.g., `export DATAPATH="proteinworkshop/data"``

However, for full exploration we recommend cloning the repository and building from source.

Building from source

With a local virtual environment activated (e.g., one created with conda create -n proteinworkshop python=3.10): 1. Clone and install the project

```bash
git clone https://github.com/a-r-j/ProteinWorkshop
cd ProteinWorkshop
pip install -e .
```

Install PyTorch (specifically version 2.1.2 or newer) using its official pip installation instructions, with CUDA support as desired

```bash

e.g., to install PyTorch with CUDA 11.8 support on Linux:

pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118 ```
Then use the newly-installed proteinworkshop CLI to install PyTorch Geometric

bash workshop install pyg
Configure paths in .env (optional, will override default paths if set). See .env.example for an example.
Download PDB data:

bash python proteinworkshop/scripts/download_pdb_mmtf.py

Tutorials

We provide a five-part tutorial series of Jupyter notebooks to provide users with examples of how to use and extend proteinworkshop, as outlined below.

Quickstart

Downloading datasets

Datasets can either be built from the source structures or downloaded from Zenodo. Datasets will be built from source the first time a dataset is used in a run (or by calling the appropriate setup() method in the corresponding datamodule). We provide a CLI tool for downloading datasets:

```bash workshop download workshop download pdb workshop download cath workshop download afdbrepv4

etc..

```

If you wish to build datasets from source, we recommend first downloading the entire PDB first (in MMTF format, c. 24 Gb) to reuse shared PDB data as much as possible:

```bash workshop download pdb

or

python proteinworkshop/scripts/downloadpdbmmtf.py ```

Training a model

Launching an experiment minimally requires specification of a dataset, structural encoder, and task (devices can be specified with trainer=cpu/gpu):

```bash workshop train dataset=cath encoder=egnn task=inverse_folding trainer=cpu env.paths.data=where/you/want/data/

or

python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding trainer=cpu # or trainer=gpu ```

This command uses the default configurations in configs/train.yaml, which can be overwritten by equivalently named options. For instance, you can use a different input featurisation using the features option, or set the display name of your experiment on wandb using the name option:

```bash workshop train dataset=cath encoder=egnn task=inversefolding features=cabb name=MY-EXPT-NAME trainer=cpu env.paths.data=where/you/want/data/

or

python proteinworkshop/train.py dataset=cath encoder=egnn task=inversefolding features=cabb name=MY-EXPT-NAME trainer=cpu # or trainer=gpu ```

Finetuning a model

Finetuning a model additionally requires specification of a checkpoint.

```bash workshop finetune dataset=cath encoder=egnn task=inversefolding ckptpath=PATH/TO/CHECKPOINT trainer=cpu env.paths.data=where/you/want/data/

or

python proteinworkshop/finetune.py dataset=cath encoder=egnn task=inversefolding ckptpath=PATH/TO/CHECKPOINT trainer=cpu # or trainer=gpu ```

Running a sweep/experiment

We can make use of the hydra wandb sweeper plugin to configure experiments as sweeps, allowing searches over hyperparameters, architectures, pre-training/auxiliary tasks and datasets.

See proteinworkshop/config/sweeps/ for examples.

Create the sweep with weights and biases

bash wandb sweep proteinworkshop/config/sweeps/my_new_sweep_config.yaml

Launch job workers

With wandb:

bash wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 8

Or an example SLURM submission script:

```bash #!/bin/bash #SBATCH --nodes 1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:1 #SBATCH --array=0-32

source ~/.bashrc source $(conda info --base)/envs/proteinworkshop/bin/activate

wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 1 ```

Reproduce the sweeps performed in the manuscript:

```bash

reproduce the baseline tasks sweep (i.e., those performed without pre-training each model)

wandb sweep proteinworkshop/config/sweeps/baselinefold.yaml wandb agent mywandbgroup/proteinworkshop/2awtt7oy --count 8 wandb sweep proteinworkshop/config/sweeps/baselineppi.yaml wandb agent mywandbgroup/proteinworkshop/2bwtt7oy --count 8 wandb sweep proteinworkshop/config/sweeps/baselineinversefolding.yaml wandb agent mywandbgroup/proteinworkshop/2cwtt7oy --count 8

reproduce the model pre-training sweep

wandb sweep proteinworkshop/config/sweeps/pre_train.yaml wandb agent mywandbgroup/proteinworkshop/2dwtt7oy --count 8

reproduce the pre-trained tasks sweep (i.e., those performed after pre-training each model)

wandb sweep proteinworkshop/config/sweeps/ptfold.yaml wandb agent mywandbgroup/proteinworkshop/2ewtt7oy --count 8 wandb sweep proteinworkshop/config/sweeps/ptppi.yaml wandb agent mywandbgroup/proteinworkshop/2fwtt7oy --count 8 wandb sweep proteinworkshop/config/sweeps/ptinversefolding.yaml wandb agent mywandbgroup/proteinworkshop/2gwtt7oy --count 8 ```

Embedding a dataset

We provide a utility in proteinworkshop/embed.py for embedding a dataset using a pre-trained model. To run it: bash python proteinworkshop/embed.py ckpt_path=PATH/TO/CHECKPOINT collection_name=COLLECTION_NAME See the embed section of proteinworkshop/config/embed.yaml for additional parameters.

Visualising pre-trained model embeddings for a given dataset

We provide a utility in proteinworkshop/visualise.py for visualising the UMAP embeddings of a pre-trained model for a given dataset. To run it: bash python proteinworkshop/visualise.py ckpt_path=PATH/TO/CHECKPOINT plot_filepath=VISUALISATION/FILEPATH.png See the visualise section of proteinworkshop/config/visualise.yaml for additional parameters.

Performing attribution of a pre-trained model

We provide a utility in proteinworkshop/explain.py for performing attribution of a pre-trained model using integrated gradients.

This will write PDB files for all the structures in a dataset for a supervised task with residue-level attributions in the b_factor column. To visualise the attributions, we recommend using the Protein Viewer VSCode extension and changing the 3D representation to colour by Uncertainty/Disorder.

To run the attribution:

bash python proteinworkshop/explain.py ckpt_path=PATH/TO/CHECKPOINT output_dir=ATTRIBUTION/DIRECTORY

See the explain section of proteinworkshop/config/explain.yaml for additional parameters.

Verifying a config

bash python proteinworkshop/validate_config.py dataset=cath features=full_atom task=inverse_folding

Using `proteinworkshop` modules functionally

One may use the modules (e.g., datasets, models, featurisers, and utilities) of proteinworkshop functionally by importing them directly. When installing this package using PyPi, this makes building on top of the assets of proteinworkshop straightforward and convenient.

For example, to use any datamodule available in proteinworkshop:

```python from proteinworkshop.datasets.cath import CATHDataModule

datamodule = CATHDataModule(path="data/cath/", pdbdir="data/pdb/", format="mmtf", batchsize=32) datamodule.download()

traindl = datamodule.traindataloader() ```

To use any model or featuriser available in proteinworkshop:

```python from proteinworkshop.models.graphencoders.dimenetpp import DimeNetPPModel from proteinworkshop.features.factory import ProteinFeaturiser from proteinworkshop.datasets.utils import createexample_batch

model = DimeNetPPModel(hiddenchannels=64, numlayers=3) cafeaturiser = ProteinFeaturiser( representation="CA", scalarnodefeatures=["aminoacidonehot"], vectornodefeatures=[], edgetypes=["knn16"], scalaredgefeatures=["edgedistance"], vectoredge_features=[], )

examplebatch = createexamplebatch() batch = cafeaturiser(example_batch)

modeloutputs = model(examplebatch) ```

Read the docs for a full list of modules available in proteinworkshop.

Models

Invariant Graph Encoders

| Name | Source | Protein Specific | | ----------- | ----------- | ----------- | | GearNet| Zhang et al. | | DimeNet++ | Gasteiger et al. | | SchNet | Schtt et al. | | CDConv | Fan et al. |

Equivariant Graph Encoders

(Vector-type)

| Name | Source | Protein Specific | | ----------- | ----------- | --------- | | GCPNet | Morehead et al. | | GVP-GNN | Jing et al. | | EGNN | Satorras et al. |

(Tensor-type)

| Name | Source | Protein Specific | | ----------- | ----------- | --------- | | Tensor Field Network | Corso et al. | | Multi-ACE | Batatia et al. |

Sequence-based Encoders

| Name | Source | Protein Specific | | ----------- | ----------- | ----------- | | ESM2| Lin et al. |

Datasets

To download a (processed) dataset from Zenodo, you can run

bash workshop download <DATASET_NAME>

where <DATASET_NAME> is given the first column in the tables below.

Otherwise, simply starting a training run will download and process the data from source.

Structure-based Pre-training Corpuses

Pre-training corpuses (with the exception of pdb, cath, and astral) are provided in FoldComp database format. This format is highly compressed, resulting in very small disk space requirements despite the large size. pdb is provided as a collection of MMTF files, which are significantly smaller in size than conventional .pdb or .cif file.

| Name | Description | Source | Size | Disk Size | License | | ----------- | ----------- | ----------- | --- | -- | ---- | | astral| SCOPe domain structures | SCOPe/ASTRAL | | 1 - 2.2 Gb | Publicly available | afdb_rep_v4| Representative structures identified from the AlphaFold database by FoldSeek structural clustering | Barrio-Hernandez et al. | 2.27M Chains | 9.6 Gb| GPL-3.0 | | afdb_rep_dark_v4| Dark proteome structures identied by structural clustering of the AlphaFold database. | Barrio-Hernandez et al. | ~800k | 2.2 Gb| GPL-3.0 | | afdb_swissprot_v4| AlphaFold2 predictions for SwissProt/UniProtKB | Kim et al. | 542k Chains | 2.9 Gb | GPL-3.0 | | afdb_uniprot_v4| AlphaFold2 predictions for UniProt | Kim et al. | 214M Chains| 1 Tb| GPL-3.0 / CC-BY 4.0| | cath| CATH 4.2 40% split by CATH topologies. | Ingraham et al. | ~18k chains| 4.3 Gb| CC-BY 4.0 | esmatlas | ESMAtlas predictions (full) | Kim et al. | | 1 Tb | GPL-3.0 / CC-BY 4.0 | esmatlas_v2023_02| ESMAtlas predictions (v202302 release) | Kim et al. | | 137 Gb| GPL-3.0 / CC-BY 4.0 | `highqualityclust30| [ESMAtlas](https://esmatlas.com/) High Quality predictions | [Kim et al.](https://academic.oup.com/bioinformatics/article/39/4/btad153/7085592) | 37M Chains | 114 Gb | [GPL-3.0](https://github.com/steineggerlab/foldcomp/blob/master/LICENSE.txt) / [CC-BY 4.0](https://esmatlas.com/about) |igfoldpairedoas| IGFold Predictions for [Paired OAS](https://journals.aai.org/jimmunol/article/201/8/2502/107069/Observed-Antibody-Space-A-Resource-for-Data-Mining) | [Ruffolo et al.](https://www.nature.com/articles/s41467-023-38063-x) | 104,994 paired Ab chains | | [CC-BY 4.0](https://www.nature.com/articles/s41467-023-38063-x#rightslink) |igfold_jaffe| IGFold predictions for [Jaffe2022](https://www.nature.com/articles/s41586-022-05371-z) data | [Ruffolo et al.](https://www.nature.com/articles/s41467-023-38063-x) | 1,340,180 paired Ab chains | | [CC-BY 4.0](https://www.nature.com/articles/s41467-023-38063-x#rightslink) |pdb`| Experimental structures deposited in the RCSB Protein Data Bank | wwPDB consortium | ~800k Chains |23 Gb | CC0 1.0 |

Additionally, we provide several species-specific compilations (mostly reference species)

| Name | Description | Source | Size | | ----------------| ----------- | ------ | ------ | | `a_thaliana` | _Arabidopsis thaliana_ (thale cress) proteome | AlphaFold2| | `c_albicans` | _Candida albicans_ (a fungus) proteome | AlphaFold2| | `c_elegans` | _Caenorhabditis elegans_ (roundworm) proteome | AlphaFold2 | | | `d_discoideum` | _Dictyostelium discoideum_ (slime mold) proteome | AlphaFold2| | | `d_melanogaster` | [_Drosophila melanogaster_](https://www.uniprot.org/taxonomy/7227) (fruit fly) proteome | AlphaFold2 | | | `d_rerio` | [_Danio rerio_](https://www.uniprot.org/taxonomy/7955) (zebrafish) proteome | AlphaFold2 | | | `e_coli` | _Escherichia coli_ (a bacteria) proteome | AlphaFold2 | | | `g_max` | _Glycine max_ (soy bean) proteome | AlphaFold2 | | | `h_sapiens` | _Homo sapiens_ (human) proteome | AlphaFold2 | | | `m_jannaschii` | _Methanocaldococcus jannaschii_ (an archaea) proteome | AlphaFold2 | | | `m_musculus` | _Mus musculus_ (mouse) proteome | AlphaFold2 | | | `o_sativa` | _Oryza sative_ (rice) proteome | AlphaFold2 | | | `r_norvegicus` | _Rattus norvegicus_ (brown rat) proteome | AlphaFold2 | | | `s_cerevisiae` | _Saccharomyces cerevisiae_ (brewer's yeast) proteome | AlphaFold2 | | | `s_pombe` | _Schizosaccharomyces pombe_ (a fungus) proteome | AlphaFold2 | | | `z_mays` | _Zea mays_ (corn) proteome | AlphaFold2 | |

Supervised Datasets

| Name | Description | Source | License | | ----------- | ----------- | ----------- | ---- | | antibody_developability | Antibody developability prediction | Chen et al. | CC-BY 3.0 | | atom3d_msp | Mutation stability prediction | Townshend et al. | MIT | | atom3d_ppi | Protein-protein interaction prediction | Townshend et al. | MIT | | atom3d_psr | Protein structure ranking | Townshend et al. | MIT | | atom3d_res | Residue identity prediction | Townshend et al. | MIT | |ccpdb_ligands| Ligand binding residue prediction | Agrawal et al. | Publicly Available |ccpdb_metal| Metal ion binding residue prediction | Agrawal et al. | Publicly Available |ccpdb_nucleic| Nucleic acid binding residue prediction | Agrawal et al. | Publicly Available |ccpdb_nucleotides| Nucleotide binding residue prediction | Agrawal et al. | Publicly Available | deep_sea_proteins | Gene Ontology prediction (Biological Process) | Sieg et al. | Public domain | go-bp | Gene Ontology prediction (Biological Process) | Gligorijevic et al | CC-BY 4.0| | go-cc | Gene Ontology (Cellular Component) | Gligorijevic et al | CC-BY 4.0 | | go-mf | Gene Ontology (Molecular Function) | Gligorijevic et al | CC-BY 4.0 | | ec_reaction | Enzyme Commission (EC) Number Prediction | Hermosilla et al. | MIT | fold_fold | Fold prediction, split at the fold level | Hou et al. | CC-BY 4.0 | fold_family | Fold prediction, split at the family level | Hou et al. | CC-BY 4.0 | fold_superfamily | Fold prediction, split at the superfamily level | Hou et al. | CC-BY 4.0 | masif_site | Protein-protein interaction site prediction | Gainza et al. | Apache 2.0 | metal_3d | Zinc Binding Site Prediction | Duerr et al. | MIT | ptm | Post Translational Modification Side Prediction | Yan et al. | CC-BY 4.0 |

Tasks

Self-Supervised Tasks

| Name | Description | Source | | ----------- | ----------- | ----------- | | inverse_folding | Predict amino acid sequence given structure | | | residue_prediction | Masked residue type prediction | | | distance_prediction | Masked edge distance prediction | Zhang et al. | | angle_prediction | Masked triplet angle prediction | Zhang et al. | | dihedral_angle_prediction | Masked quadruplet dihedral prediction | Zhang et al. | | multiview_contrast | Contrastive learning with multiple crops and InfoNCE loss | Zhang et al. | | structural_denoising | Denoising of atomic coordinates with SE(3) decoders | |

Generic Supervised Tasks

Generic supervised tasks can be applied broadly across datasets. The labels are directly extracted from the PDB structures.

These are likely to be most frequently used with the pdb dataset class which wraps the PDB Dataset curator from Graphein.

| Name | Description | Requires | | ----------- | ----------- | ----------- | | binding_site_prediction | Predict ligand binding residues | HETATM ligands (for training) | | ppi_site_prediction | Predict protein binding residues | graph_y attribute in data objects specifying the desired chain to select interactions for (for training) |

Featurisation Schemes

Part of the goal of the proteinworkshop benchmark is to investigate the impact of the degree to which increasing granularity of structural detail affects performance. To achieve this, we provide several featurisation schemes for protein structures.

Invariant Node Features

N.B. All angular features are provided in [sin, cos] transformed form. E.g.: $\textrm{dihedrals} = [sin(\phi), cos(\phi), sin(\psi), cos(\psi), sin(\omega), \cos(\omega)]$, hence their dimensionality will be double the number of angles.

| Name | Description | Dimensionality | | ----------- | ----------- | ----------- | | residue_type | One-hot encoding of amino acid type | 21 | | positional_encoding | Transformer-like positional encoding of sequence position | 16 | | alpha | Virtual torsion angle defined by four $C\alpha$ atoms of residues $I{-1}, I, I{+1}, I{+2}$ | 2 | | kappa | Virtual bond angle (bend angle) defined by the three $C\alpha$ atoms of residues $I{-2}, I, I{+2}$ | 2 | | dihedrals | Backbone dihedral angles $(\phi, \psi, \omega)$ | 6 | | `sidechaintorsions` | Sidechain torsion angles $(\chi_{1-4})$ | 8 |

Equivariant Node Features

| Name | Description | Dimensionality | | ----------- | ----------- | ----------- | | orientation | Forward and backward node orientation vectors (unit-normalized) | 2 |

Edge Construction

We predominanty support two types of edges: $k$-NN and $\epsilon$ edges.

Edge types can be specified as follows:

bash python proteinworkshop/train.py ... features.edge_types=[knn_16, knn_32, eps_16]

Where the suffix after knn or eps specifies $k$ (number of neighbours) or $\epsilon$ (distance threshold in angstroms).

Invariant Edge Features

| Name | Description | Dimensionality | | ----------- | ----------- | ----------- | | edge_distance | Euclidean distance between source and target nodes | 1 | | node_features | Concatenated scalar node features of the source and target nodes | Number of scalar node features $\times 2$ | | edge_type | Type annotation for each edge | 1 | | sequence_distance | Sequence-based distance between source and target nodes | 1 | | pos_emb | Structured Transformer-inspired positional embedding of $i - j$ for source node $i$ and target node $j$ | 16 |

Equivariant Edge Features

| Name | Description | Dimensionality | | ----------- | ----------- | ----------- | | edge_vectors | Edge directional vectors (unit-normalized) | 1 |

For Developers

Dependency Management

We use poetry to manage the project's underlying dependencies and to push updates to the project's PyPI package. To make changes to the project's dependencies, follow the instructions below to (1) install poetry on your local machine; (2) customize the dependencies; or (3) (de)activate the project's virtual environment using poetry: 1. Install poetry for platform-agnostic dependency management using its installation instructions

After installing `poetry`, to avoid potential [keyring errors](https://github.com/python-poetry/poetry/issues/1917#issuecomment-1235998997), disable its keyring usage by adding `PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring` to your shell's startup configuration and restarting your shell environment (e.g., `echo 'export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring' >> ~/.bashrc && source ~/.bashrc` for a Bash shell environment and likewise for other shell environments).

Install, add, or upgrade project dependencies

bash poetry install # install the latest project dependencies # or poetry add XYZ # add dependency `XYZ` to the project # or poetry show # list all dependencies currently installed # or poetry lock # standardize the (now-)installed dependencies
Activate the newly-created virtual environment following poetry's usage documentation

bash # activate the environment on a `posix`-like (e.g., macOS or Linux) system source $(poetry env info --path)/bin/activate powershell # activate the environment on a `Windows`-like system & ((poetry env info --path) + "\Scripts\activate.ps1") bash # if desired, deactivate the environment deactivate

Code Formatting

To keep with the code style for the proteinworkshop repository, using the following lines, please format your commits before opening a pull request: ```bash

assuming you are located in the `ProteinWorkshop` top-level directory

isort . autoflake -r --in-place --remove-unused-variables --remove-all-unused-imports --ignore-init-module-imports . black --config=pyproject.toml . ```

Documentation

To build a local version of the project's Sphinx documentation web pages: ```bash

assuming you are located in the `ProteinWorkshop` top-level directory

pip install -r docs/.docs.requirements # one-time only rm -rf docs/build/ && sphinx-build docs/source/ docs/build/ # NOTE: errors can safely be ignored ```

Citing `ProteinWorkshop`

Please consider citing proteinworkshop if it proves useful in your work.

```bibtex @inproceedings{ jamasb2024evaluating, title={Evaluating Representation Learning on the Protein Structure Universe}, author={Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V. Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, Tom L. Blundell}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, }

```

Owner

Name: Arian Jamasb
Login: a-r-j
Kind: user
Location: Basel
Company: University of Cambridge

Website: jamasb.io
Twitter: arian_jamasb
Repositories: 32
Profile: https://github.com/a-r-j

Principal ML Scientist @PrescientDesign / Tensor Jockey / PhD @ University of Cambridge Prev: MILA, Google X, Relation Therapeutic

Citation (citation.bib)

@inproceedings{
jamasb2024evaluating,
title={Evaluating Representation Learning on the Protein Structure Universe},
author={Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V. Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, Tom L. Blundell},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
}

GitHub Events

Total

Issues event: 9
Watch event: 47
Delete event: 1
Issue comment event: 22
Push event: 2
Pull request event: 4
Fork event: 8

Last Year

Issues event: 9
Watch event: 47
Delete event: 1
Issue comment event: 22
Push event: 2
Pull request event: 4
Fork event: 8

Committers

Last synced: 9 months ago

All Time

Total Commits: 223
Total Committers: 13
Avg Commits per committer: 17.154
Development Distribution Score (DDS): 0.655

Past Year

Commits: 5
Committers: 2
Avg Commits per committer: 2.5
Development Distribution Score (DDS): 0.4

Top Committers

Name	Email	Commits
Arian Jamasb	a**b@r**m	77
chaitjo	c**9@g**m	43
Alex Morehead	a**d@g**m	43
Arian Jamasb	a**b@g**m	27
kierandidi	k**i@g**m	16
Linus Leong	6****h	4
Mahdi Pourmirzaei	4****2	3
joshic2	c**i@r**m	3
dependabot[bot]	4****]	2
Simon Mathis	s**s@g**m	2
Simon Mathis	s**s@a**m	1
Gökçe Uludoğan	g**n@g**m	1
Jamasb	j**a@s**m	1

Committer Domains (Top 20 + Academic)

roche.com: 2 sc1nc001is01.eth.rsiec.sc1.science.roche.com: 1 absci.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 24
Total pull requests: 80
Average time to close issues: 11 days
Average time to close pull requests: 4 days
Total issue authors: 16
Total pull request authors: 9
Average comments per issue: 3.63
Average comments per pull request: 0.71
Merged pull requests: 69
Bot issues: 0
Bot pull requests: 4

Past Year

Issues: 5
Pull requests: 4
Average time to close issues: 6 days
Average time to close pull requests: about 11 hours
Issue authors: 5
Pull request authors: 2
Average comments per issue: 4.0
Average comments per pull request: 1.5
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

a-r-j (5)
amorehead (3)
pengzhangzhi (2)
yangzhang33 (2)
biochunan (1)
asaksager (1)
martinaegidius (1)
mahdip72 (1)
anonymous-0545 (1)
Katja-Jagd (1)
paoslaos (1)
ann-las (1)
wojcik2g (1)
AJB117 (1)
gokceuludogan (1)

Pull Request Authors

amorehead (29)
a-r-j (25)
chaitjo (15)
kierandidi (11)
linusyh (5)
dependabot[bot] (4)
mahdip72 (2)
gokceuludogan (2)
Croydon-Brixton (1)

Top Labels

Issue Labels

bug (3) dependency (1) enhancement (1) data (1)

Pull Request Labels

dependency (4)

Packages

Total packages: 1
Total downloads:
- pypi 30 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 10
Total maintainers: 1

pypi.org: proteinworkshop

Homepage: https://www.proteins.sh
Documentation: https://proteins.sh
License: MIT
Latest release: 0.2.5
published about 2 years ago

Versions: 10
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 30 Last month

Rankings

Stargazers count: 9.0%

Dependent packages count: 10.0%

Forks count: 13.3%

Average: 14.3%

Downloads: 17.3%

Dependent repos count: 21.7%

Maintainers (1)

arianj

Last synced: 6 months ago

Dependencies

environment.yaml pypi

aiohttp ==3.8.4
aiosignal ==1.3.1
alabaster ==0.7.13
alembic ==1.10.3
antlr4-python3-runtime ==4.9.3
anyio ==3.6.2
appdirs ==1.4.4
argparse ==1.4.0
arrow ==1.2.3
ase ==3.22.1
asttokens ==2.2.1
async-timeout ==4.0.2
atom3d ==0.2.6
attrs ==23.1.0
autopage ==0.5.1
babel ==2.12.1
backcall ==0.2.0
beartype ==0.13.1
beautifulsoup4 ==4.12.2
biopandas ==0.5.0.dev0
biopython ==1.81
bioservices ==1.11.2
biotite ==0.36.1
black ==23.3.0
bleach ==6.0.0
blessed ==1.20.0
blosc2 ==2.0.0
cattrs ==22.2.0
click ==8.1.3
cliff ==4.2.0
cmaes ==0.9.1
cmd2 ==2.4.3
colorlog ==6.7.0
contourpy ==1.0.7
croniter ==1.3.14
cycler ==0.11.0
cython ==0.29.34
dateutils ==0.6.12
decorator ==5.1.1
deepdiff ==6.3.0
defusedxml ==0.7.1
dill ==0.3.6
docker-pycreds ==0.4.0
docopt ==0.6.2
docutils ==0.19
e3nn ==0.5.1
easy-parallel ==0.1.6
easydev ==0.12.1
einops ==0.6.0
entrypoints ==0.4
et-xmlfile ==1.1.0
exceptiongroup ==1.1.1
executing ==1.2.0
fair-esm ==2.0.0
fastapi ==0.88.0
fastavro ==1.7.3
fastcore ==1.5.29
fastjsonschema ==2.16.3
foldcomp ==0.0.4.post1
fonttools ==4.39.3
freesasa ==2.2.0.post3
frozenlist ==1.3.3
fsspec ==2023.4.0
furo ==2023.3.27
gevent ==22.10.2
gitdb ==4.0.10
gitpython ==3.1.31
goatools ==1.3.1
greenlet ==2.0.2
grequests ==0.6.0
h11 ==0.14.0
h5py ==3.8.0
httpcore ==0.17.0
httpx ==0.24.0
hydra-colorlog ==1.2.0
hydra-core ==1.3.2
hydra-optuna-sweeper ==1.2.0
icecream ==2.1.3
imagesize ==1.4.1
importlib-metadata ==6.4.1
importlib-resources ==5.12.0
inquirer ==3.1.3
ipython ==8.12.0
isort ==5.12.0
itsdangerous ==2.1.2
jaxtyping ==0.2.15
jedi ==0.18.2
jsonschema ==4.17.3
jupyter-client ==8.2.0
jupyter-core ==5.3.0
jupyterlab-pygments ==0.2.2
kiwisolver ==1.4.4
lightning ==2.0.1.post0
lightning-cloud ==0.5.33
lightning-utilities ==0.8.0
lion-pytorch ==0.0.7
lmdb ==1.4.1
loguru ==0.7.0
looseversion ==1.1.2
lovely-numpy ==0.2.8
lovely-tensors ==0.1.14
lxml ==4.9.2
m2r2 ==0.3.3.post2
mako ==1.2.4
markdown-it-py ==2.2.0
matplotlib ==3.7.1
matplotlib-inline ==0.1.6
mdurl ==0.1.2
mistune ==0.8.4
mmtf-python ==1.1.3
msgpack ==1.0.5
multidict ==6.0.4
multipledispatch ==0.6.0
multiprocess ==0.70.14
mypy-extensions ==1.0.0
nbclient ==0.7.3
nbconvert ==6.5.4
nbformat ==5.8.0
nbsphinx ==0.9.1
nbsphinx-link ==1.3.0
nbstripout ==0.6.1
numexpr ==2.8.4
numpy ==1.23.5
omegaconf ==2.3.0
openpyxl ==3.1.2
opt-einsum ==3.3.0
opt-einsum-fx ==0.1.4
optuna ==2.10.1
ordered-set ==4.1.0
pandas ==1.5.3
pandoc ==2.3
pandocfilters ==1.5.0
parso ==0.8.3
pathos ==0.3.0
pathspec ==0.11.1
pathtools ==0.1.2
patsy ==0.5.3
pbr ==5.11.1
pexpect ==4.8.0
pickleshare ==0.7.5
plotly ==5.14.1
plumbum ==1.8.1
ply ==3.11
pox ==0.3.2
ppft ==1.7.6.6
prettytable ==3.7.0
prompt-toolkit ==3.0.38
proteinshake ==0.3.9
protobuf ==4.22.3
ptyprocess ==0.7.0
pure-eval ==0.2.2
py-cpuinfo ==9.0.0
pydantic ==1.10.7
pydocstyle ==6.3.0
pydot ==1.4.2
pygments ==2.15.0
pyjwt ==2.6.0
pyperclip ==1.8.2
pyrootutils ==1.0.4
pyrr ==0.10.3
pyrsistent ==0.19.3
python-dateutil ==2.8.2
python-dotenv ==1.0.0
python-editor ==1.0.4
python-multipart ==0.0.6
pytorch-lightning ==2.0.1.post0
pytz ==2023.3
pyyaml ==5.4.1
pyzmq ==25.0.2
rdkit-pypi ==2022.9.5
readchar ==4.0.5
requests-cache ==1.0.1
rich ==13.3.4
rich-click ==1.6.1
seaborn ==0.12.2
sentry-sdk ==1.19.1
setproctitle ==1.3.2
six ==1.16.0
smmap ==5.0.0
sniffio ==1.3.0
snowballstemmer ==2.2.0
soupsieve ==2.4.1
sphinx ==6.1.3
sphinx-basic-ng ==1.0.0b1
sphinx-codeautolink ==0.14.1
sphinx-copybutton ==0.5.2
sphinx-inline-tabs ==2022.1.2b11
sphinxcontrib-applehelp ==1.0.4
sphinxcontrib-devhelp ==1.0.2
sphinxcontrib-gtagjs ==0.2.1
sphinxcontrib-htmlhelp ==2.0.1
sphinxcontrib-jsmath ==1.0.1
sphinxcontrib-qthelp ==1.0.3
sphinxcontrib-serializinghtml ==1.1.5
sphinxext-opengraph ==0.8.2
sqlalchemy ==2.0.9
stack-data ==0.6.2
starlette ==0.22.0
starsessions ==1.3.0
statsmodels ==0.13.5
stevedore ==5.0.0
suds-community ==1.1.2
tables ==3.8.0
tenacity ==8.2.2
tinycss2 ==1.2.1
tomli ==2.0.1
torchmetrics ==0.11.4
tornado ==6.3
traitlets ==5.9.0
typeguard ==3.0.2
tzdata ==2023.3
url-normalize ==1.4.3
uvicorn ==0.21.1
wandb ==0.14.2
watermark ==2.3.1
wcwidth ==0.2.6
webencodings ==0.5.1
websocket-client ==1.5.1
websockets ==11.0.2
wget ==3.2
wrapt ==1.15.0
xarray ==2023.4.0
xlsxwriter ==3.1.0
xmltodict ==0.13.0
yarl ==1.8.2
zipp ==3.15.0
zope-event ==4.6
zope-interface ==6.0

setup.py pypi

proteinworkshop

Science Score: 67.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Protein Workshop

Contents

Installation

From PyPI

install proteinworkshop from PyPI

install PyTorch Geometric using the (now-installed) CLI

set a custom data directory for file downloads; otherwise, all data will be downloaded to site-packages

Building from source

e.g., to install PyTorch with CUDA 11.8 support on Linux:

Tutorials

Quickstart

Downloading datasets

etc..

or

Training a model

or

or

Finetuning a model

or

Running a sweep/experiment

reproduce the baseline tasks sweep (i.e., those performed without pre-training each model)

reproduce the model pre-training sweep

reproduce the pre-trained tasks sweep (i.e., those performed after pre-training each model)

Embedding a dataset

Visualising pre-trained model embeddings for a given dataset

Performing attribution of a pre-trained model

Verifying a config

Using proteinworkshop modules functionally

Models

Invariant Graph Encoders

Equivariant Graph Encoders

(Vector-type)

(Tensor-type)

Sequence-based Encoders

Datasets

Structure-based Pre-training Corpuses

Supervised Datasets

Tasks

Self-Supervised Tasks

Generic Supervised Tasks

Featurisation Schemes

Invariant Node Features

Equivariant Node Features

Edge Construction

Invariant Edge Features

Equivariant Edge Features

For Developers

Dependency Management

Code Formatting

assuming you are located in the ProteinWorkshop top-level directory

Documentation

assuming you are located in the ProteinWorkshop top-level directory

Citing ProteinWorkshop

Owner

Citation (citation.bib)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

install `proteinworkshop` from PyPI

set a custom data directory for file downloads; otherwise, all data will be downloaded to `site-packages`

Using `proteinworkshop` modules functionally

assuming you are located in the `ProteinWorkshop` top-level directory

assuming you are located in the `ProteinWorkshop` top-level directory

Citing `ProteinWorkshop`