OpenFEPOPS

OpenFEPOPS: A Python implementation of the FEPOPS molecular similarity technique - Published in JOSS (2023)

https://github.com/justinykc/fepops

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: acs.org, joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 62% confidence
Chemistry Physical Sciences - 46% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Python implementation of the FEPOPS molecular descriptors

Basic Info
  • Host: GitHub
  • Owner: JustinYKC
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 17.1 MB
Statistics
  • Stars: 7
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Created about 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Code of conduct Citation

README.md

OpenFEPOPS(FEature POint PharmacophoreS)

status

OpenFEPOPS is an open-source Python implementation of the FEPOPS molecular similarity technique enabling descriptor generation, comparison, and ranking of molecules in virtual screening campaigns. The central idea behind FEPOPS is reducing the complexity of molecules by merging of local atomic environments and atom properties into ‘feature points’. This compressed feature point representation has been used to great effect as noted in literature, helping researchers identify active and potentially therapeutically valuable small molecules. This implementation was recreated following the original paper: https://pubs.acs.org/doi/10.1021/jm049654z. By default, OpenFEPOPS uses literature reported parameters which show good performance in retrieval of active lead- and drug-like small molecules within virtual screening campaigns, with feature points capturing charge, lipophilicity, and hydrogen bond acceptor and donor status. When run with default parameters, OpenFepops compactly represents molecules using sets of four feature points, with each feature point encoded into 22 numeric values, resulting in a compact representation of 616 bytes per molecule. By extension, this allows the indexing of a compound archive containing 1 million small molecules using only 587.5 MB of data. Whilst more compact representations are readily available, the FEPOPS technique strives to capture tautomer and conformer information, first through enumeration and then through diversity driven selection of representative FEPOPS descriptors to capture the diverse states that a molecule may adopt.

Statement of need

At the time of writing, OpenFEPOPS is the only publicly available implementation of the FEPOPS molecular similarity technique. Whilst used within industry and referenced extensively in literature, it has been unavailable to researchers as an open-source tool. We welcome contributions and collaborative efforts to enhance and expand OpenFEPOPS using the associated GitHub repository. It is therefore hoped that this will allow the technique to be used not only for traditional small molecule molecular similarity, but also in new emerging fields such as protein design and featurization of small- and macro-molecules for both predictive and generative tasks.

Benchmarking

OpenFEPOPS has been benchmarked alongside some commonly used fingerprints against the DUDE diversity set and scores as follows:

| Target | Morgan 2 | MACCS | RDKit |OpenFEPOPS| |--------:|----------:|-------:|-------:|------------:| |akt1 |0.836|0.741 |0.833 |0.831| |ampc |0.784|0.673|0.660 |0.639| |cp3a4 |0.603|0.582|0.613 |0.647| |cxcr4 |0.697|0.854|0.592 |0.899| |gcr |0.670|0.666|0.708 |0.616| |hivpr |0.780|0.681|0.759 |0.678| |hivrt |0.651|0.670 |0.660 |0.582| |kif11 |0.763|0.668 |0.672 |0.713|

Average AUROC scores by target and molecular similarity technique for the DUDE diversity set. Across all datasets, 19 small molecules out of 112,796 were excluded from analysis mainly due to issues in parsing to valid structures using RDKit. See 'ExploreDUDEdiversity_set.ipynb' for further information

Method description

Whilst OpenFEPOPS has included functionality for descriptor caching and profiling of libraries, the core functionality of the package is descriptor generation and scoring.

Descriptor generation

  1. Tautomer enumeration
    • For a given small molecule, OpenFEPOPS uses RDKit to iterate over molecular tautomers. By default, there is no limit to the number of recoverable tautomers but a limit may be imposed which may be necessary if adapting the OpenFEPOPS code to large macromolecules and not just small molecules.
  2. Conformer enumeration
    • For each tautomer, up to 1024 conformers are sampled by either complete enumeration of rotatable bond states (at the literature reported optimum increment of 90 degrees) if there are five or less rotatable bonds, or through random sampling of 1024 possible states if there are more than 5 rotatable bonds.
  3. Defining feature points
    • The KMeans algorithm is applied to each conformer of each tautomer to identify four (by default) representative or central points, into which the atomic information of neighbouring atoms is collapsed. As standard, the atomic properties of charge, logP, hydrogen bond donor, and hydrogen bond acceptor status are collapsed into four feature points per unique tautomer conformation.The RDKit package is used to calculate these properties with the iterative Gasteiger charges algorithm applied to assign atomic charges, the Crippen method used to assign atomic logP contributions, and hydrogen bond acceptors and donors identified with appropriate SMARTS substructure queries.These feature points are encoded to 22 numeric values (a FEPOP) comprising four points, each with four properties, and six pairwise distances between these points. With many FEPOPS descriptors collected from a single molecule through tautomer and conformer enumeration, this set of representative FEPOPS should capture every possible state of the original molecule.
  4. Selection of diverse FEPOPS
    • From the collection of FEPOPS derived from every tautomer conformation of a molecule, the K-Medoid algorithm is applied to identify seven (by default) diverse FEPOPS which are thought to best capture a fuzzy representation of the molecule. These seven FEPOPS each comprise 22 descriptors each, totalling 154 32-bit floating point numbers or 616 bytes.

OpenFEPOPS descriptor generation showing the capture of tautomer and conformer information from a single input molecule.\label{fig:descriptor_generation}

Descriptor generation with OpenFEPOPS is a compute intensive task and as noted in literature, designed to be run in situations where large compound archives have had their descriptors pre-generated and are queried against realatively small numbers of new molecules for which descriptors are not known and are generated. To enable use in this manner, OpenFEPOPS provides functionality to cache descriptors through specification of database files, either in the SQLite or JSON formats.

Scoring and comparison of molecules based on their molecular descriptors

  1. Sorting
    • With seven (by default) diverse FEPOPS representing a small molecule, the FEPOPS are sorted by ascending charge.
  2. Scaling
    • Due to the different scales and distributions of features comprising FEPOPS descriptors, each FEPOP is centered and scaled according to observed mean and standard deviations of the same features within a larger pool of molecules. By default, these means and standard deviations have been derived from the DUDE diversity set which captures known actives and decoys for a diverse set of therapeutic targets.
  3. Scoring
    • The Pearson correlation coefficient is calculated for the scaled descriptors of the first molecule to the scaled descriptors of the second.

Literature highlights that the choice of the Pearson correlation coefficient leads to high background scores as it is highly unlikely to see little correlation between any molecule due to fundamental limitations of chemistry and geometry. Therefore, unrelated molecules are likely to have FEPOPS similarity scores higher than those encountered with more traditional techniques such as bitstring fingeprints and Tanimoto or Dice similarity measures.

Installation

OpenFEPOPS requires Python >=3.9 and may be installed using pip: console pip install fepops Alternatively, OpenFEPOPS can be installed from source: console git clone https://github.com/JustinYKC/FEPOPS.git cd FEPOPS pip install .

Or via conda : console conda env create -f environment.yml

Requirements for developing and contributing to OpenFEPOPS can be installed from source: console git clone https://github.com/JustinYKC/FEPOPS.git cd FEPOPS pip install .[dev]

Testing

Tests may be run using the pytest framework. With the development dependencies installed (see above), simply running:

console pytest

will run the built in tests.

Requirements

OpenFEPOPS requires the following packages: - rdkit (>=2021.9.5.1) - numpy (>=1.22.4) - pandas (>=1.5.0) - scikit-learn (>=1.3.x) - scipy (>=1.7.x) - torch (>=1.7.1) - fast-pytorch-kmeans (>=0.1.9) - fire (>=0.5.x) - tqdm (>=4.48.0)

Usage

OpenFEPOPS has been uploaded to the Python Packaging Index under the name 'fepops' and as such is installable using the pip package manager and the command 'pip install fepops'. With the package installed, entrypoints are used to expose commonly used OpenFEPOPS tasks such as descriptor generation and calculation on molecular similarity, enabling simple command line access without the need to explicitly invoke a Python interpreter. Whilst OpenFEPOPS may be used solely via the command line interface, a robust API is available and may be used within other programs or integrated into existing pipelines to enable more complex workflows. API documentation is available at https://justinykc.github.io/FEPOPS.

Command line usage

With OpenFEPOPS installed to a Python environment, entrypoints in the code and a command line interface control object allow omission of the python interpreter program when calling common OpenFEPOPS tasks. This allows command line usage of the form:

console fepops <subcommand> <arguments>

These sub-commands to carry out common tasks are: 1. calcsim - Calculates the molecular similarity of two supplied molecules as SMILES strings. 2. getfepops - Calculate and print out the molecular descriptors of a molecule supplied as a SMILES string 3. savedescriptors - To be used in conjunction with the --databasefile switch (see below) which enables writing of generated descriptors to a SQLite or JSON database/file cache. A list of smiles, or the location of a SMILES file may be passed here. If passing a single molecule, then enclose the smiles in square brackets so that it is passed as a list containing one item.

A database or cache file may be supplied for use with all subcommands using the --databasefile switch before the subcommand as follows: ```console fepops --databasefile= ``` Depending on the supplied file extension of , either a SQLite file (".sqlite", ".sqlite3", ".db", ".db3", ".s3db", or ".sl3" file extensions) is created or loaded for the appending of data, or a JSON file is used (when the extension is ".json").

Additionally, a file to which JSON formatted ouput should be written to can be supplied with the --jsonfile flag when using the calcsim and get_fepops commands as follows:

console fepops --database_file=<DB_file_path> --json_file=<JSON_file_path> get_fepops <arguments> fepops --database_file=<DB_file_path> --json_file=<JSON_file_path> calc_sim <arguments> where is the path to where JSON formatted output should be written.

Running OpenFEPOPS with the -h switch prints help to the terminal and it may also be used after sub-command switches to inspect required arguments.

Example: Calculating molecular similarity between two molecules (ibuprofen and diclofenac)

console fepops calc_sim "CC(Cc1ccc(cc1)C(C(=O)O)C)C" "OC(=O)Cc1ccccc1Nc1c(Cl)cccc1Cl" Note the use of quotes around the smiles strings. This is required as BASH and other shells will try to parse the brackets often present in smiles to denote branching.

Output may be directed to a JSON file as follows:

console fepops --json_file="example_fepops_score.json" calc_sim "CC(Cc1ccc(cc1)C(C(=O)O)C)C" "OC(=O)Cc1ccccc1Nc1c(Cl)cccc1Cl"

The file 'examplefepopsscore.json' then contains output of the form:

javascript { "SMI1": "CC(Cc1ccc(cc1)C(C(=O)O)C)C", "SMI2": "OC(=O)Cc1ccccc1Nc1c(Cl)cccc1Cl", "FEPOPS Similarity Score": 0.14135371590471596 }

Example: Get the FEPOPS descriptors for diclofenac

console fepops get_fepops "OC(=O)Cc1ccccc1Nc1c(Cl)cccc1Cl" Again, output may be directed to a JSON file as follows:

console fepops --json_file="example_diclofenac_fepops_descriptors.json" get_fepops "OC(=O)Cc1ccccc1Nc1c(Cl)cccc1Cl"

The file 'examplefepopsscore.json' then contains output of the form below. For brevity, output is truncated at the first of 7 FEPOPS, truncation marked by TRUNCATION:

javascript { "SMILES": "OC(=O)Cc1ccccc1Nc1c(Cl)cccc1Cl", "FepopStatusCode": "SUCCESS", "Fepops": [ [ -0.5684914743317748, 1.3061, 1.0, 1.0, 0.02878293161743492, 0.8432999999999999, 0.0, 1.0, 0.1949443985093149, 1.0734, 0.0, 1.0, 0.3447641442050251, 1.1413, 0.0, 1.0, 2.52506114069371, 4.753703619492762, 3.216333750680169, 4.563318239587458, 2.9657848574016255, 7.096648170644156 ], TRUNCATION }

Example: pregenerating descriptors for an in-house compound archive

With a SMILES file called 'inhousecompounds.smi', we may pre-generate their FEPOPS descriptors for faster use and comparison against new molecules. ```console fepops --databasefile=inhousecompounds.db savedescriptors inhouse_compounds.smi ```

Owner

  • Name: Yan-Kai Chen
  • Login: JustinYKC
  • Kind: user
  • Location: Oxford

PhD candidate in Cheminformatics, Data science and Drug discovery at the University of Edinburgh

JOSS Publication

OpenFEPOPS: A Python implementation of the FEPOPS molecular similarity technique
Published
November 09, 2023
Volume 8, Issue 91, Page 5763
Authors
Yan-Kai Chen ORCID
School of Biological Sciences, University of Edinburgh, The King’s Buildings, Max Born Crescent, CH Waddington Building, Edinburgh, EH9 3BF, United Kingdom.
Douglas R. Houston ORCID
School of Biological Sciences, University of Edinburgh, The King’s Buildings, Max Born Crescent, CH Waddington Building, Edinburgh, EH9 3BF, United Kingdom.
Manfred Auer ORCID
School of Biological Sciences, University of Edinburgh, The King’s Buildings, Max Born Crescent, CH Waddington Building, Edinburgh, EH9 3BF, United Kingdom., Xenobe Research Institute, P. O. Box 3052, San Diego, California, 92163, United States.
Steven Shave ORCID
School of Biological Sciences, University of Edinburgh, The King’s Buildings, Max Born Crescent, CH Waddington Building, Edinburgh, EH9 3BF, United Kingdom.
Editor
Richard Gowers ORCID
Tags
molecular similarity virtual screening pharmacophores feature points

Citation (CITATION.cff)

cff-version: "1.2.0"
authors:
- family-names: Chen
  given-names: Yan-Kai
  orcid: "https://orcid.org/0000-0001-7161-9503"
- family-names: Houston
  given-names: Douglas R.
  orcid: "https://orcid.org/0000-0002-3469-1546"
- family-names: Auer
  given-names: Manfred
  orcid: "https://orcid.org/0000-0001-8920-3522"
- family-names: Shave
  given-names: Steven
  orcid: "https://orcid.org/0000-0001-6996-3663"
contact:
- family-names: Shave
  given-names: Steven
  orcid: "https://orcid.org/0000-0001-6996-3663"
doi: 10.6084/m9.figshare.24477184.v1
message: If you use this software, please cite our article in the
  Journal of Open Source Software.
preferred-citation:
  authors:
  - family-names: Chen
    given-names: Yan-Kai
    orcid: "https://orcid.org/0000-0001-7161-9503"
  - family-names: Houston
    given-names: Douglas R.
    orcid: "https://orcid.org/0000-0002-3469-1546"
  - family-names: Auer
    given-names: Manfred
    orcid: "https://orcid.org/0000-0001-8920-3522"
  - family-names: Shave
    given-names: Steven
    orcid: "https://orcid.org/0000-0001-6996-3663"
  date-published: 2023-11-09
  doi: 10.21105/joss.05763
  issn: 2475-9066
  issue: 91
  journal: Journal of Open Source Software
  publisher:
    name: Open Journals
  start: 5763
  title: "OpenFEPOPS: A Python implementation of the FEPOPS molecular
    similarity technique"
  type: article
  url: "https://joss.theoj.org/papers/10.21105/joss.05763"
  volume: 8
title: "OpenFEPOPS: A Python implementation of the FEPOPS molecular
  similarity technique"

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 225
  • Total Committers: 2
  • Avg Commits per committer: 112.5
  • Development Distribution Score (DDS): 0.391
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Steven Shave s****e@g****m 137
justin_maai1 j****4@g****m 88

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 16
  • Total pull requests: 9
  • Average time to close issues: 15 days
  • Average time to close pull requests: 4 days
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 1.88
  • Average comments per pull request: 0.22
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • exs-cbouy (8)
  • stevenshave (3)
  • hannahbaumann (3)
  • JustinYKC (1)
Pull Request Authors
  • JustinYKC (7)
  • stevenshave (3)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 21 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 11
  • Total maintainers: 1
pypi.org: fepops

OpenFEPOPS; A Python implementation of the FEPOPS molecular similarity method

  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 21 Last month
Rankings
Dependent packages count: 7.0%
Average: 26.9%
Dependent repos count: 30.4%
Forks count: 30.6%
Stargazers count: 39.5%
Maintainers (1)
Last synced: 4 months ago

Dependencies

.github/workflows/ci.yaml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/documentation.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
  • peaceiris/actions-gh-pages v3 composite
.github/workflows/draft-pdf.yml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v1 composite
  • openjournals/openjournals-draft-action master composite
pyproject.toml pypi
  • fast-pytorch-kmeans >=0.1.9
  • fire >=0.5.0
  • numpy >=1.22.4
  • pandas >=1.5.0
  • rdkit >=2022.3.1
  • scikit-learn >=1.3.0
  • scipy >=1.7.0
  • torch *
  • tqdm >=4.48.0
requirements.txt pypi
  • fast-pytorch-kmeans ==0.1.9
  • fire ==0.5.0
  • numpy ==1.22.4
  • pandas ==1.5.0
  • rdkit ==2022.3.1
  • scikit-learn ==1.3.0
  • scipy ==1.7.0
  • torch ==1.7.1
  • tqdm ==4.48.0
setup.py pypi
environment.yml conda
  • _libgcc_mutex 0.1
  • _openmp_mutex 4.5
  • blas 1.0
  • boost 1.78.0
  • boost-cpp 1.78.0
  • bottleneck 1.3.5
  • brotli 1.0.9
  • brotli-bin 1.0.9
  • bzip2 1.0.8
  • ca-certificates 2023.5.7
  • cairo 1.16.0
  • certifi 2023.5.7
  • contourpy 1.0.7
  • cycler 0.11.0
  • expat 2.5.0
  • fontconfig 2.14.1
  • fonttools 4.39.4
  • freetype 2.12.1
  • gettext 0.21.1
  • glib 2.76.3
  • glib-tools 2.76.3
  • greenlet 2.0.2
  • icu 72.1
  • intel-openmp 2023.1.0
  • kiwisolver 1.4.4
  • lcms2 2.15
  • ld_impl_linux-64 2.38
  • lerc 4.0.0
  • libbrotlicommon 1.0.9
  • libbrotlidec 1.0.9
  • libbrotlienc 1.0.9
  • libdeflate 1.18
  • libexpat 2.5.0
  • libffi 3.4.4
  • libgcc-ng 12.2.0
  • libglib 2.76.3
  • libiconv 1.17
  • libjpeg-turbo 2.1.5.1
  • libpng 1.6.39
  • libstdcxx-ng 12.2.0
  • libtiff 4.5.0
  • libuuid 1.41.5
  • libwebp-base 1.3.0
  • libxcb 1.15
  • libxml2 2.10.4
  • libzlib 1.2.13
  • llvm-openmp 16.0.4
  • matplotlib-base 3.7.1
  • mkl 2023.1.0
  • mkl-service 2.4.0
  • mkl_fft 1.3.6
  • mkl_random 1.2.2
  • munkres 1.1.4
  • ncurses 6.4
  • numexpr 2.8.4
  • numpy 1.24.3
  • numpy-base 1.24.3
  • openjpeg 2.5.0
  • openssl 1.1.1t
  • packaging 23.1
  • pandas 1.5.3
  • pcre2 10.40
  • pillow 9.5.0
  • pip 23.0.1
  • pixman 0.40.0
  • pthread-stubs 0.4
  • pycairo 1.23.0
  • pyparsing 3.0.9
  • python 3.11.3
  • python-dateutil 2.8.2
  • python_abi 3.11
  • pytz 2022.7
  • readline 8.2
  • reportlab 3.6.13
  • setuptools 66.0.0
  • six 1.16.0
  • sqlalchemy 2.0.15
  • sqlite 3.41.2
  • tbb 2021.8.0
  • tk 8.6.12
  • typing-extensions 4.6.2
  • typing_extensions 4.6.2
  • tzdata 2023c
  • wheel 0.38.4
  • xorg-libxau 1.0.11
  • xorg-libxdmcp 1.1.3
  • xz 5.4.2
  • zlib 1.2.13
  • zstd 1.5.2