pmfpredictor-toolkit

A set of tools for the prediction of small molecule - surface potentials of mean force using Tensorflow.

https://github.com/ijrouse/pmfpredictor-toolkit

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary

Last synced: 11 months ago · JSON representation

Repository

A set of tools for the prediction of small molecule - surface potentials of mean force using Tensorflow.

Basic Info

Host: GitHub
Owner: ijrouse
License: other
Language: Python
Default Branch: main
Homepage:
Size: 292 MB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 3

Created about 4 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

README

===========
PMFPredictor Toolkit
Maintainer: Ian Rouse, ian.rouse@ucd.ie
===========

This repository contains the sets of files and scripts needed to generate PMFs via a machine-learning approach employing an expansion in hypergeometric coefficients, and this document explains how to use them.

===========
Introduction and quick start
===========

The easiest way to use this package is via the GUI, which you can run using "python3 PMFPredictorToolkit.py" or by setting it to be executable via "chmod 755 PMFPredictorToolkit.py", then "./PMFPredictorToolkit.py" .

To make everything work, you'll need an up-to-date Python installation with Tensorflow installed, which should also give you NumPy and SciPy. You'll also need pandas and matplotlib for predictions, and PySide6 for running the GUI. You will also need to download the model repository to get the pre-trained models for the actual prediction of PMFs.
A set of pre-trained models is available via Zenodo: 10.5281/zenodo.7253566 corresponding to the original publication. 
Generating new chemical structures requires acpype (pip3 install acpype) which in turn requires openbabel - download the source, build that with python bindings, install, then use "pip3 install openbabel" to set up all the bindings. 

If you generate or use results from this repository please cite the following publication:
I. Rouse, V. Lobaskin, Machine-learning based prediction of small molecule -- surface interaction potentials,  	arXiv:2211.07999
https://arxiv.org/abs/2211.07999
===========
Complete Contents
===========
The toolkit contains a number of scripts, broadly divided into levels as follows (see the PMFPredictorWorkflow.pdf flowchart for a high-level summary)

Level 0: Pre-processing of surface and chemical structures.
ACPYPE2CSV.py: A small helper script that calls ACPYPE (which must be installed separately) to convert SMILES codes into the CSV molecular format used by the rest of the toolkit. This is kept in its own folder due to the large volume of output generated during operation.
NPtoCSV.py: A tool for converting NP structures generated using CHARMM-GUI to the CSV molecular format used by the toolkit. GROMACS output must be enabled during operation of CHARMM-GUI and by default it uses non-equilibriated structures but this can be edited - any .gro file containing co-ordinates can be used as input.

Level 1: Generation of potentials.
GenerateSurfacePotentials.py: Takes in a set of CSV-formatted surface structures and calculates the free energy for a set of point probes specified by Lennard-Jones parameters and a charge, plus water
GenerateChemicalPotentials.py : Takes in a set of CSV-formatted chemical structures and calculates the free energy for a set of point probes specified by Lennard-Jones parameters and a charge, plus water

Level 2: Extraction of hypergeometric expansion (HGE) coefficients
HGExpandSurfacePotential.py, HGExpandChemicalPotential.py : Construct HGExpansions of the potentials obtained using the scripts in level 1 for all surfaces and chemicals.
HGExpandPMFs.py : Constructs HGExpansions of the PMFs used for training. 

Level 3: Neural network training
BuildPMFPredictor.py: Constructs and trains the ANN used to convert HG coefficients describing the input potentials to the coefficients describing a PMF.

Level 4: PMF output
BuildPredictedPMFs.py: Generates PMFs for all surface/chemical pairs using the model trained in Level 3.
CalcEAdsSet.py: Extract binding energies from generated PMFs.


Most of these are accessible via the GUI (PMFPredictorToolkit.py) but some are more efficient to run via command line. See the Wiki page on Github for details.
===========
Source? SSD type?
===========
The "source" variable for a surface defines whether it should be treated as a Stockholm-style calculation (if 0 or 1) or a UCD style (if 1 or 2). These most obviously lead to differences for gold, for which UCD-style has a large water peak and Stockholm-style does not. The choice for new materials is left to the user but it should be noted that there are significantly more Stockholm-style PMFs in the training set covering a wider range of materials.
"SSD type" is provided to differentiate between different definitions of the SSD and comes in four flavours:
1) SSD is defined relative to a fixed z or rho co-ordinate for planar and cylindricals respectively. 
2) SSD is defined as the minimum distance between the COM of the adsorbate and any atom in the surface. 
3) SSD is defined by the centre-of-slab to centre-of-adsorbate distance minus one-half the slab width
4) SSD is defined by the centre-of-sheet to centre-of-adsorbate distance.
Note that the SSD variable is zero-indexed such that class 1 corresponds to a value of 0, etc.

Generally, we suggest using source type 1 (Stockholm with ions) together with the SSD class 1  (reference surface, variable = 0) and this is implemented as the default option. These settings are also used for the generation of the canonical set.
===========
AN IMPORTANT NOTE ON TRAINING/VALIDATION/TEST splits
===========
All PMFS in "AllPMFS" are considered fair game for model development - the scripts used to generate the datasets are greedy and will read these in if they are present. Thus, anything which should be kept for sight-unseen testing should not be added to this folder. As a failsafe, the script skips these if either the surface or chemical are not registered in the definitions files.
The training/validation splits used to optimise parameters/hyperparameters are generated based on a cluster analysis to try to ensure the training set contains as wide a variety as possible. For generation of PMFs, the bootstrap ensemble is used which uses a simple random split such that all surfaces and chemicals feature in the training set for a given model.


===========
Workflow for generating PMFs for new surfaces
===========
1) Generate a set of combined coordinate files for all the atoms in the structure, place in Structures/Surfaces. This requires co-ordinates, masses, charges, and LJ 6-12 coefficients for each atom present.
	- Suggested method: use CHARMMGUI with GROMACS output enabled and apply NPtoCSV.py by editing the list of targets to the folder downloaded from CHARMMGUI.
2) Update Structures/SurfaceDefinitions.csv to register your new surface, using the other materials as templates to choose appropriate definitions for the SSD type and source if these are for use in building a training set.
3) Run GenerateSurfacePotentials.py to build the free-energy probe and water tables describing all materials listed in the SurfaceDefinitions.csv table. This may take a significant amount of time - up to a few hours for a typical 30 x 30 x 15 slab
4) Run HGExpandSurfacePotential.py to generate the expansion coefficients describing the materials.

The input structures should not contain unbound water, but if this is known to be irreversibly bound to the surface it can be included.

===========
Workflow for generating PMFs for new chemicals
===========
1) Update Structures/ChemicalDefinitions.csv to register your new chemical. This has the format ID,SMILES,formal charge,comment. The comment cannot include commas (because its a csv file) or hash-marks (#) but otherwise can contain arbitrary text, e.g. force field and method used to generate the structure, a descriptive name, etc. If a comma is needed use "" and for # use "" in e.g. SMILES strings.
2)  Generate a set of combined coordinate files for all the atoms in the structure, place in Structures/Chemicals. This requires co-ordinates, masses, charges, and LJ 6-12 coefficients for each atom present.
	- Suggested method: use ChemicalFromACPYPE.py script if you have ACPYPE installed. This scans over a list of target SMILES codes and IDs to build the structure and put it into the correct format. If your molecule has chirality thats 		not specified by the SMILES code you should be careful to make sure it's been correctly generated. I would recommend appending "-AC" to the ID target to make it clear it was made by ACPYPE, but this is not mandatory.
	- "python3 ChemicalFromACPYPE.py -s 1" scans ChemicalDefinitions.csv for any chemicals without structures files and builds these using ACPYPE. 
3) Run GenerateChemicalPotentials.py
4) Run HGExpandChemicalPotential.py to generate the table of expansion coefficients

===========
Advice on naming surfaces and chemicals
===========
Use short but descriptive names - standard characters are fine except for underscores. This is because the scripts used for reading in and out PMFs assume filenames of the form materialname_chemicalname so if you put underscores in expect errors to take place.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science