ppiref

Dataset and package for working with protein-protein interactions in 3D

https://github.com/anton-bushuiev/ppiref

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, pubmed.ncbi, ncbi.nlm.nih.gov, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.7%) to scientific vocabulary

Keywords

datasets machine-learning protein-protein-interaction proteins
Last synced: 9 months ago · JSON representation ·

Repository

Dataset and package for working with protein-protein interactions in 3D

Basic Info
  • Host: GitHub
  • Owner: anton-bushuiev
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage: https://ppiref.readthedocs.io
  • Size: 15.1 MB
Statistics
  • Stars: 91
  • Watchers: 4
  • Forks: 8
  • Open Issues: 4
  • Releases: 5
Topics
datasets machine-learning protein-protein-interaction proteins
Created over 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

PPIRef

Documentation badge arXiv badge Zenodo badge License: MIT Python package Python Versions

PPIRef is a Python package for working with 3D structures of protein-protein interactions (PPIs). It is based on the PPIRef dataset, comprising all PPIs from the Protein Data Bank (PDB). The package aims to provide standard data and tools for machine learning and data science applications involving protein-protein interaction structures. PPIRef includes the following functionalities:

  • Extracting protein-protein interfaces from .pdb files.
  • Visualizing and analyzing the properties of PPIs.
  • Comparing, deduplicating and clustering PPI interfaces.
  • Retrieving similar PPIs from PDB by similar interface structure or sequence.
  • Downloading, splitting and subsampling prepared PPIs for machine learning applications.

Please see the documentation for usage examples and API reference. See also our paper for additional details.

Quick start 🚀

Install the PPIRef package.

bash conda create -n ppiref python=3.10 conda activate ppiref git clone https://github.com/anton-bushuiev/PPIRef.git cd PPIRef; pip install -e .

Download the dataset using the package (in Python).

```python from ppiref.utils.misc import downloadfromzenodo from ppiref.split import readfold from ppiref.utils.ppi import PPI downloadfromzenodo('ppi6A.zip') # or for example 'pdbredoppi_10A.zip' for all 10-Angstrom PPIs from PDB-REDO

Downloading: 100%|██████████| 6.94G/6.94G [10:19<00:00, 11.2MiB/s] Extracting: 100%|██████████| 831382/831382 [02:36<00:00, 5313.49files/s] ```

Read the data fold/subset you need (whole PPIRef50K in the example).

```python ppipaths = readfold('ppiref6Afilteredclustered04', 'whole') print('Dataset size:', len(ppi_paths))

Dataset size: 51755 ```

Now you are ready to work with the PPIRef dataset! Example of a sample:

```python ppi = PPI(ppi_paths[0]) print('Path:', ppi.path) print('Statistics:', ppi.stats) ppi.visualize()

Path: /Users/anton/dev/PPIRef/ppiref/data/ppiref/ppi6A/hc/3hchA_B.pdb Statistics: {'KIND': 'heavy', 'EXTRACTION RADIUS': 6.0, 'EXPANSION RADIUS': 0.0, 'RESOLUTION': 2.1, 'STRUCTURE METHOD': 'x-ray diffraction', 'DEPOSITION DATE': '2009-05-06', 'RELEASE DATE': '2009-10-13', 'BSA': 682.5337386399999} ```

Further, the PPIRef package provides utilities for comparing, deduplicating, and clustering PPI interfaces, as well as for retrieving similar PPIs from PDB by similar interface structure or sequence. Please see the documentation for more details.

TODO

The repository is under development. Please do not hesitate to contact us or create an issue/PR if you have any questions or suggestions ✌️.

Technical

  • [x] PPIRef (6A interfaces) on Zenodo
  • [x] PPIRef (10A interfaces) on Zenodo (expected in June 2024)
  • [x] PPIRef version based on the PDB-REDO database for higher-quality side chains in the structures (expected in June 2024)
  • [x] Docstrings

Enhancements

  • [ ] Cluster all PPIs to sample from clusters rather than removing near duplicates completely (similar to UniRef seeds)
  • [ ] Add RASA values to classify residues according to Levy 2010
  • [ ] Classify PPIs according to Ofran2003

References

If you find this repository useful, please cite our paper:

bibtex @article{bushuiev2024learning, title={Learning to design protein-protein interactions with enhanced generalization}, author={Anton Bushuiev and Roman Bushuiev and Petr Kouba and Anatolii Filkin and Marketa Gabrielova and Michal Gabriel and Jiri Sedlar and Tomas Pluskal and Jiri Damborsky and Stanislav Mazurenko and Josef Sivic}, booktitle={ICLR 2024 (The Twelfth International Conference on Learning Representations)}, url={https://doi.org/10.48550/arXiv.2310.18515}, year={2024} }

If relevant, please also cite the corresponding paper on data leakage in protein interaction benchmarks:

bibtex @article{bushuiev2024revealing, title={Revealing data leakage in protein interaction benchmarks}, author={Anton Bushuiev and Roman Bushuiev and Jiri Sedlar and Tomas Pluskal and Jiri Damborsky and Stanislav Mazurenko and Josef Sivic}, booktitle={ICLR 2024 Workshop on Generative and Experimental Perspectives for Biomolecular Design}, url={https://doi.org/10.48550/arXiv.2404.10457}, year={2024} }

If you find any of the external software useful, please cite the corresponding papers (see PPIRef/external/README.md).

Owner

  • Name: Anton Bushuiev
  • Login: anton-bushuiev
  • Kind: user
  • Location: Prague
  • Company: Czech Technical University in Prague

PhD student. Machine learning / computational biology 🤖🌱

Citation (citation.bib)

@article{
  bushuiev2024learning,
  title={Learning to design protein-protein interactions with enhanced generalization},
  author={Anton Bushuiev and Roman Bushuiev and Petr Kouba and Anatolii Filkin and Marketa Gabrielova and Michal Gabriel and Jiri Sedlar and Tomas Pluskal and Jiri Damborsky and Stanislav Mazurenko and Josef Sivic},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

GitHub Events

Total
  • Create event: 2
  • Release event: 2
  • Issues event: 7
  • Watch event: 11
  • Issue comment event: 15
  • Push event: 11
  • Fork event: 1
Last Year
  • Create event: 2
  • Release event: 2
  • Issues event: 7
  • Watch event: 11
  • Issue comment event: 15
  • Push event: 11
  • Fork event: 1