https://github.com/dptech-corp/uni-dock-benchmarks
Uni-Dock-Benchmarks contains a curated collection of datasets and benchmarking tests for evaluating the performance and accuracy of the Uni-Dock docking system. This repository is intended for use in continuous integration testing and for researchers seeking to compare docking results with established benchmarks.
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com, science.org, rsc.org, acs.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Repository
Uni-Dock-Benchmarks contains a curated collection of datasets and benchmarking tests for evaluating the performance and accuracy of the Uni-Dock docking system. This repository is intended for use in continuous integration testing and for researchers seeking to compare docking results with established benchmarks.
Basic Info
- Host: GitHub
- Owner: dptech-corp
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 2.14 GB
Statistics
- Stars: 12
- Watchers: 4
- Forks: 1
- Open Issues: 1
- Releases: 1
Metadata Files
README.md
Uni-Dock-Benchmarks
The Uni-Dock-Benchmarks repository provides a comprehensive collection of datasets for benchmarking the Uni-Dock docking system's performance and accuracy. The datasets include prepared structures and input files for both Uni-Dock V1 and V2 for benchmarks.
Data
Benchmark data within the repository is categorized into two primary sections:
molecular_dockingvirtual_screening
Molecular Docking Benchmarks
Under the molecular_docking directory, you will find several well-known benchmark datasets:
Astex: Hartshorn, M. J., Verdonk, M. L., Chessari, G., Brewerton, S. C., Mooij, W. T., Mortenson, P. N., & Murray, C. W. (2007). Diverse, high-quality test set for the validation of protein ligand docking performance. Journal of medicinal chemistry, 50(4), 726-741.CASF2016: Su, M., Yang, Q., Du, Y., Feng, G., Liu, Z., Li, Y., & Wang, R. (2018). Comparative assessment of scoring functions: the CASF-2016 update. Journal of chemical information and modeling, 59(2), 895-913.PoseBuster: Buttenschoen, M., Morris, G. M., & Deane, C. M. (2024). PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chemical Science.
We performed the following preparation steps for the proteins and ligands in the datasets.
- After obtaining the protein structures from the RCSB database based on the PDB code, we retained the crystal waters that affect the binding mode and completed missing protein side chains and lost hydrogen atoms.
- For ligands, we searched the RCSB database for the isomer SMILES corresponding to the PDB code and determined the correct protonation state according to the receptor pocket environment. Then, we generated 3D conformations for each ligand.
After excluding systems for covalent ligand bindings, problematic binding mechanisms and those with large natural products or polypeptide ligands, 69 systems from Astex, 271 systems from CASF-2016 and 396 systems from PoseBuster were used as benchmarks.
The correctness of protein side chain structure and hydrogen bond networks have crucial impact on ligand docking, and hence the structure preparation for both protein and ligand determines the difficultness of producing correct ligand docking poses. We use our internal tools to prepare the initial structures of receptor and ligands so that we can obtain better docking results. In addition, we also integrated the open-sourced version of structure preparation algorithmd for Uni-Dock V2 into the unified protocol in the Uni-Dock V2 github repository.
We prepare the receptor structure in two versions, protein with co-crystalized water version and protein only verison, to test the overall effect of the presence of water opn ligand docking experiments.
The directory structure for each dataset is as follows:
<DataSetName>
<PDB_ID>
<PDB_ID>_ligand.sdf # Ligand co-crystal structure processed in SDF format
<PDB_ID>_protein_water_cleaned.pdb # Prepared receptor structrue with protein and crystalized water in PDB format
<PDB_ID>_protein_cleaned.pdb # Prepared receptor structrue with only protein in PDB format
ligand_prepared.sdf # Reprepared ligand 3D conformation used in docking test in SDF format
unidock1_protein # Folder for input files of Uni-Dock V1, with protein only in the receptor structure
ligand_prepared_torsion_tree.sdf # Prepared ligand structure with torsion tree information used in Uni-Dock V1 input in SDF format
receptor.pdbqt # Prepared receptor structure used in Uni-Dock V1 input in PDBQT format
unidock1_protein_water # Folder for input files of Uni-Dock V1, with protein and water in the receptor structure
ligand_prepared_torsion_tree.sdf # Prepared ligand structure with torsion tree information used in Uni-Dock V1 input in SDF format
receptor.pdbqt # Prepared receptor structure used in Uni-Dock V1 input in PDBQT format
unidock2_protein # Folder for input files of Uni-Dock V2, with protein only in the receptor structure
<PDB_ID>_unidock2.json # Integrated JSON input file for Uni-Dock V2 docking engine
receptor_parameterized.dms # Prepared and parameterized receptor structure in DMS format
unidock2_protein_water # Folder for input files of Uni-Dock V2, with protein and water in the receptor structure
<PDB_ID>_unidock2.json # Integrated JSON input file for Uni-Dock V2 docking engine
receptor_parameterized.dms # Prepared and parameterized receptor structure in DMS format
pdb_center.csv # CSV file recording the protein pocket center with respect to the <PDB_ID> for each system
Virtual Screening Benchmarks
Under the virtual_screening directory, you will find several meticulously selected benchmark datasets:
D4: Lyu, J., Wang, S., Balius, T. E., Singh, I., Levit, A., Moroz, Y. S., ... & Irwin, J. J. (2019). Ultra-large library docking for discovering new chemotypes. Nature, 566(7743), 224-229.GBA: Tran-Nguyen, V. K., Jacquemard, C., & Rognan, D. (2020). LIT-PCBA: an unbiased data set for machine learning and virtual screening. Journal of chemical information and modeling, 60(9), 4263-4273.NSP3: Schuller, M., Correy, G. J., Gahbauer, S., Fearon, D., Wu, T., Daz, R. E., ... & Ahel, I. (2021). Fragment binding to the Nsp3 macrodomain of SARS-CoV-2 identified through crystallographic screening and computational docking. Science advances, 7(16), eabf8711.PPARG: Tran-Nguyen, V. K., Jacquemard, C., & Rognan, D. (2020). LIT-PCBA: an unbiased data set for machine learning and virtual screening. Journal of chemical information and modeling, 60(9), 4263-4273.sigma2: Alon, A., Lyu, J., Braz, J. M., Tummino, T. A., Craik, V., OMeara, M. J., ... & Kruse, A. C. (2021). Structures of the 2 receptor enable docking for bioactive ligand discovery. Nature, 600(7890), 759-764.
The following table summarizes the statistics of the datasets:
| Dataset | PDB ID | NActives | NInactives | N_Total | |----|----|----|----|----| | D4 | 5WIU | 226 | 598 | 824 | | GBA | 5LVX | 286 | 458,205 | 458,491 | | NSP3 | 5RS7 | 65 | 3,515 | 3,580 | | PPARG | 5Y2T | 29 | 7,292 | 7,321 | | sigma2 | 7M94 | 228 | 596 | 824 |
The directory structure for each dataset is as follows:
<DataSetName>
docking_grid.json # JSON file recording the protein pocket center and the box sizes
<PDB_ID>_receptor.pdb # Original unprocessed receptor structure in PDB format
<PDB_ID>_protein_cleaned.pdb # Prepared receptor structure with only protein in PDB format
actives_cleaned.sdf # Preprocessed and cleaned active molecules in SDF format
actives.sdf # Active molecules in SDF format
inactives_cleaned.sdf # Preprocessed and cleaned inactive molecules in SDF format
inactives.sdf # Inactive molecules in SDF format
unidock1_protein # Folder for input files of Uni-Dock V1, with protein only in the receptor structure
actives_prepared_torsion_tree.sdf # Prepared active molecule structure with torsion tree information used in Uni-Dock V1 input in SDF format
inactives_prepared_torsion_tree.sdf # Prepared inactive molecule structure with torsion tree information used in Uni-Dock V1 input in SDF format
receptor.pdbqt # Prepared receptor structure used in Uni-Dock V1 input in PDBQT format
unidock2_protein # Folder for input files of Uni-Dock V2, with protein only in the receptor structure
actives_unidock2.json # Integrated JSON input file of active molecules for Uni-Dock V2 docking engine
inactives_unidock2.json # Integrated JSON input file of inactive molecules for Uni-Dock V2 docking engine
receptor_parameterized.dms # Prepared and parameterized receptor structure in DMS format
Important Note
Due to the substantial number of inactive molecules in the GBA dataset, the directory contains several large files that exceed GitHub's size limits. These files have been moved to cloud storage. To obtain the complete GBA directory, please run the following command in your terminal:
sh
./getGBA.sh
Scripts
Scripts are provided to run tests directly on Uni-Dock2 executable binary. Simply run molecular docking test on Uni-Dock2 engine binary:
sh
python scripts/run_tests.py --version 2 --bin ud2 --type molecular_docking --nowater --device 1 --savedir my_res --seed 121
Please read scripts/run_tests.py for complete argument documentation. Key parameters include:
--version <1|2>For example, you can use--version 1if you are testing on Uni-Dock 1https://github.com/dptech-corp/Uni-Dock
NOTE Specifying
--version 2automatically loads the filescripts/ud2.yamlas default configuration.
--bin <PATH>Path of the tested executable binary--type <molecular_docking|virtual_screening>--nowaterSelect receptor without water. (Default: uses water-containing receptor)
ATTENTION: Since docking searches are always incomplete, we recommend:
1. Repeating tests with multiple random seeds (--seed <INTEGER>)
2. Average results across repetitions.
Owner
- Name: DP Technology
- Login: dptech-corp
- Kind: organization
- Location: China
- Website: https://www.dp.tech/en
- Repositories: 9
- Profile: https://github.com/dptech-corp
GitHub Events
Total
- Create event: 4
- Release event: 1
- Issues event: 1
- Watch event: 13
- Delete event: 4
- Push event: 15
- Pull request event: 6
- Fork event: 1
Last Year
- Create event: 4
- Release event: 1
- Issues event: 1
- Watch event: 13
- Delete event: 4
- Push event: 15
- Pull request event: 6
- Fork event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 1
- Total pull requests: 4
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 4
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jslee-hits (1)
Pull Request Authors
- dp-yuanyn (4)
- kongexp (4)
- zhengh96 (2)
- caic99 (1)
- Hong-Rui (1)