https://github.com/bartongroup/fragsys

https://github.com/bartongroup/fragsys

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: nature.com, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: bartongroup
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 23.7 MB
Statistics
  • Stars: 5
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created about 3 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

FRAGSYS

This repository contains the fragment screeening analysis pipeline (FRAGSYS) used for the analysis of our manuscript Classification of likely functional class for ligand binding sites identified from fragment screening.

Our pipeline for the analysis of binding sites, FRAGSYS, can be executed from the jupyter notebook running_fragsys.ipynb. The input for this pipeline is a table containing a series of PDB codes and their respective UniProt accession identifiers.

DOI

Installation

For complete installation instructions refer here.

Pipeline methodology

Refer to run jupyter notebook running_fragsys.ipynb in order to run FRAGSYS. You can do so interactively in a notebook by running this command: main(main_dir, prot, panddas) using the appropriate environment: varalign_env.

Where main_dir is the directory where the output will be saved, prot is the query protein, and panddas is a pandas dataframe that has to contain at least two columns: entry_uniprot_accession, and pdb_id, for all protein structures in the data set.

For another example, check this other notebook where we ran FRAGSYS for the main protease (MPro) of SARS-CoV-2 (P0DTD1).

For each structural segment of each protein in panddas, FRAGSYS will: 1. Download biological assemblies from PDBe 2. Structurally superimpose structures using STAMP 3. Get accessibility and secondary structure elements from DSSP via ProIntVar 4. Mapping PDB residues to UniProt using SIFTS 5. Obtain protein-ligand interactions running Arpeggio 6. Cluster ligands into binding sites using OC 7. Generate visualisation scripts for UCSF Chimera 8. Generate multiple sequence alignment (MSA) with jackhmmer 9. Calculate Shenkin divergence score [1] 10. Calculate missense enrichment scores with VarAlign

The final output of the pipeline consists of multiple tables for each structural segment collating the results from the different steps of the analysis for each residue, and for the defined ligand binding sites. These data include relative solvent accessibility (RSA), angles, secondary structure, PDB/UniProt residue number, alignment column, column occupancy, divergence score, missense enrichment score, p-value, etc.

These tables are concatenated into master tables, with data for all 37 structual segments, which form the input for the analyses carried out in the analysis notebooks.

Refer to notebook 15 to predict RSA cluster labels for your binding sites of interest.

Dependencies

The pipeline, as well as the whole of the analysis are run in an interactive manner in a series of jupyter notebooks, found in the analysis folder.

Third party dependencies for these notebooks include: - Arpeggio (GNU GPL v3.0 License) - DSSP (Boost Software License) - Hmmer (BSD-3 Clause License) - OC - STAMP (GNU GPL v3.0 License) - ProIntVar (MIT License) - ProteoFAV (MIT License) - VarAlign (MIT License)

Other standard python libraries: - Biopython (BSD 3-Clause License) - Keras (Apache v2.0 License) - Matplotlib (PSF License) - Numpy (BSD 3-Clause License) - Pandas (BSD 3-Clause License) - Scipy (BSD 3-Clause License) - Seaborn (BSD 3-Clause License) - Scikit-learn (BSD 3-Clause License) - Tensorflow (Apache v2.0 License)

For more information on the dependencies, refere to the .yml files in the envs directory. To install all the dependencies, refer to the installation manual.

Files

Apart from the INSTALL, LICENSE and README files, there are 5 other files on this repository main directory. Two of these are python libraries, a configuration file and two notebooks. + fragsys_config.txt contains the default parameters to run FRAGSYS and it is read by fragsys.py. + fragsys.py contains all the function, lists and dictionaries needed to run the pipeline. + fragsys_main.py contains the main FRAGSYS function, where all functions in fragsys.py are called. This script represents the pipeline itself. + running_fragsys.ipynb is the notebook where the pipeline is executed in an interactive way. + running_fragsys_for_MPRO.ipynb.ipynb is the notebook where the pipeline is executed in an interactive way for a case study of SARS-CoV-2 MPro.

Directories

There are 6 directories in this repository.

scripts

This environment contains clean_pdb.py, a python script grabbed from here. This script will be used to pre-process the PDB files before running Arpeggio on them.

envs

The envs folder contains three .yml files describing the necessary packages and dependencies for the different parts of the pipeline and analysis. + arpeggio_env contains Arpeggio. + deeplearningenv contains the packages necessary to do the machine learning in notebooks 11, and 12. + main_env supports all analysis notebooks, with the exception of number 11, 12, in which the machine learning models are executed. + varalign_env is needed to run FRAGSYS.

input

The input folder contains the main input file which is used as input to run FRAGSYS on the running_fragsys notebook.

analysis

The analysis folder contains all the notebooks used to carry out the analysis of the 37 fragment screening experiments. main_env is needed to run these notebooks.

results

The results folder contains all the results files generated by the notebooks in the analysis folder.

figs

The figs folder contains the main figures generated and saved by the analysis notebooks.

Citation

If you use FRAGSYS, please cite:

Utgés, J.S. et al. Classification of likely functional class for ligand binding sites identified from fragment screening. Commun Biol 7, 320 (2024). https://doi.org/10.1038/s42003-024-05970-8

References

  1. Shenkin PS, Erman B, Mastrandrea LD. Information-theoretical entropy as a measure of sequence variability. Proteins. 1991; 11(4):297–313. Epub 1991/01/01. https://doi.org/10.1002/prot.340110408 PMID: 1758884.

Owner

  • Name: Geoff Barton's Computational Biology Group
  • Login: bartongroup
  • Kind: organization
  • Location: Dundee, Scotland, UK

GitHub Events

Total
Last Year