https://github.com/choderalab/tautomer-data

A repo that includes all the data for the 'Fitting quantum machine learning potentials to experimental free energy data: Predicting tautomer ratios in solution' manuscript

https://github.com/choderalab/tautomer-data

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

A repo that includes all the data for the 'Fitting quantum machine learning potentials to experimental free energy data: Predicting tautomer ratios in solution' manuscript

Basic Info
  • Host: GitHub
  • Owner: choderalab
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 58.4 MB
Statistics
  • Stars: 1
  • Watchers: 5
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Created over 5 years ago · Last pushed over 5 years ago

https://github.com/choderalab/tautomer-data/blob/main/

# tautomer-data
A repo that includes all the data for the 'Fitting quantum machine learning potentials to experimental free energy data: Predicting tautomer ratios in solution' manuscript
[![DOI](https://zenodo.org/badge/341970689.svg)](https://zenodo.org/badge/latestdoi/341970689)

# Data

We provide data for the following preprint https://www.biorxiv.org/content/10.1101/2020.10.24.353318v4.
The results for all calculations are shown here:
https://github.com/choderalab/tautomer-data/blob/main/data/results.csv (all values in kcal/mol).
## Dataset

The original data used for this study was taken from https://github.com/WahlOya/Tautobase. 
A subset of this was used to calculate relative free energies using a DFT approach, the molecules 
are shown here:
https://github.com/choderalab/tautomer-data/blob/main/data/input/b3lyp_tautobase_subset.txt
The subset of the subset above used for the calculations with the neural net potential is shown here:
https://github.com/choderalab/tautomer-data/blob/main/data/input/ani_tautobase_subset.txt

The text file includes the molecule name used in the manuscript, the SMILES for both tautomeric forms and the experimental ddG (in kcal/mol).


## Quantum chemistry data

The full QM data is deposited in https://github.com/choderalab/tautomer-data/blob/main/data/calculated/QM.pickle .

The pickle file contains a single dictionary (of dictionaries), with the molecule names as keys.
```
r = pickle.load(open('QM.pickle', 'rb'))
r['SAMPLmol2']
```

returns:

```
{'OC1=CC=C2C=CC=CC2=N1': {'solv': [mol1, mol2, ...],
  'vac': [mol1, mol2, ...]},
 'O=C1NC2=C(C=CC=C2)C=C1': {'solv': [mol1, mol2, ...],
  'vac': [mol1, mol2, ...]}}
```
.

For a single system (e.g. `'SAMPLmol2'`) the two tautomer molecules are identified with the SMILES string (e.g. `'OC1=CC=C2C=CC=CC2=N1'`), and the envrionment (e.g. `'solv'`).
Using these three keys (e.g. `r['SAMPLmol2]['OC1=CC=C2C=CC=CC2=N1']['solv])` one gets a list or rdkit molecules, each in the optimized 3D conformation.
The molecule contains properties that can be acces via `.GetProp()`.
The relevant properties are:

`'G'` ... the gibbs free energy in the specified environment calculated with RRHO and B3LYP/aug-cc-pVTZ

`'E_B3LYP_pVTZ'` ... electronic energy evaluated on this conformation using B3LYP/aug-cc-pVTZ

`'E_B3LYP_631G_gas'` ... electronic energy evaluated on this conformation using B3LYP/6-31G(d)

`'E_B3LYP_631G_solv'` ... electronic energy evaluated on this conformation using B3LYP/6-31G(d)/SMD
`'H'` ... the enthalpy in the specified environment calculated with RRHO and B3LYP/aug-cc-pVTZ 
`'graph_automorphism'` ... the number of graph automorphism. This is used for the calculation of `RT ln(D)` where `D` is this number.


## ANI data

### RRHO ANI dataset

The raw data for this data set is saved in https://github.com/choderalab/tautomer-data/blob/main/data/calculated/ANI1ccx_RRHO.pickle.
This pickle file contains a dictionary that can be queried using the molecule names as key.
For each molecule there are additional keys: `'t1-energies'`, `'t2-energies'`, `'t1-confs'` and `'t2-confs'`.
t1 and t2 correspond to the naming of the tautomers in the original dataset.

``t1-energies`` contains a list with the gibbs free energies after energy minimization,  `'t1-confs'` contains a rdkit molecule with the conformations after minimization.


### Alchemical free energy dataset

Results for 5 independent runs are stored here: https://github.com/choderalab/tautomer-data/tree/main/data/calculated/.
Each of the 5 ANI1ccx_vacuum_rfe_results_in_kT_300snapshots_p*.csv files contain per line three values: the name of the tautomer system, ddG [kcal/mol] and dddG [kcal/mol].

### Optimization

The retraining results are located here (including the log file and best parameter set):
https://github.com/choderalab/tautomer-data/tree/main/data/optimization .
The results for the tautomer dataset per epoch is stored in 
https://github.com/choderalab/tautomer-data/blob/main/data/optimization/combined_results.pickle
 and the training/validation/test split in
 https://github.com/choderalab/tautomer-data/blob/main/data/optimization/training_validation_tests.pickle.
 

Owner

  • Name: Chodera lab // Memorial Sloan Kettering Cancer Center
  • Login: choderalab
  • Kind: organization
  • Email: john.chodera@choderalab.org
  • Location: Memorial Sloan-Kettering Cancer Center, Manhattan, NY

GitHub Events

Total
Last Year