chemicalspace

Object-oriented Representation for Chemical Spaces

https://github.com/gmattedi/chemicalspace

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.8%) to scientific vocabulary

Keywords

chemical-space cheminformatics chemistry computational-chemistry machine-learning rdkit

Last synced: 6 months ago · JSON representation ·

Repository

Object-oriented Representation for Chemical Spaces

Basic Info

Host: GitHub
Owner: gmattedi
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 1.59 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 1

Topics

chemical-space cheminformatics chemistry computational-chemistry machine-learning rdkit

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

ChemicalSpace

An Object-Oriented Representation for Chemical Spaces

chemicalspace is a Python package that provides an object-oriented representation for chemical spaces. It is designed to be used in conjunction with the RDKit package, which provides the underlying cheminformatics functionality.

While in the awesome RDKit, the main frame of reference is that of single molecules, here the main focus is on operations on chemical spaces.

Installation

To install chemicalspace you can use pip:

bash pip install chemicalspace

Usage

The main class in chemicalspace is ChemicalSpace. The class provides a number of methods for working with chemical spaces, including reading and writing, filtering, clustering and picking from chemical spaces.

Basics

Initialization

A ChemicalSpace can be initialized from SMILES strings or RDKit molecules. It optionally takes molecule indices and scores as arguments.

```python from chemicalspace import ChemicalSpace

smiles = ('CCO', 'CCN', 'CCl') indices = ("mol1", "mol2", "mol3") scores = (0.1, 0.2, 0.3)

space = ChemicalSpace(mols=smiles, indices=indices, scores=scores)

print(space) ```

text <ChemicalSpace: 3 molecules | 3 indices | 3 scores>

Reading and Writing

A ChemicalSpace can be read from and written to SMI and SDF files.

```python from chemicalspace import ChemicalSpace

Load from SMI file

space = ChemicalSpace.fromsmi("tests/data/inputs1.smi") space.tosmi("outputs1.smi")

Load from SDF file

space = ChemicalSpace.fromsdf("tests/data/inputs1.sdf") space.tosdf("outputs1.sdf")

print(space) ```

text <ChemicalSpace: 10 molecules | 10 indices | No scores>

Indexing, Slicing and Masking

Indexing, slicing and masking a ChemicalSpace object returns a new ChemicalSpace object.

Indexing

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

print(space[0]) ```

text <ChemicalSpace: 1 molecules | 1 indices | No scores>

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi") idx = [1, 2, 4]

print(space[idx]) ```

text <ChemicalSpace: 3 molecules | 3 indices | No scores>

Slicing

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

print(space[:2]) ```

text <ChemicalSpace: 2 molecules | 2 indices | No scores>

Masking

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi") mask = [True, False, True, False, True, False, True, False, True, False]

print(space[mask]) ```

text <ChemicalSpace: 5 molecules | 5 indices | No scores>

Deduplicating

Deduplicating a ChemicalSpace object removes duplicate molecules.
See Hashing and Identity for details on molecule identity.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.fromsmi("tests/data/inputs1.smi") spacetwice = space + space # 20 molecules spacededuplicated = spacetwice.deduplicate() # 10 molecules

print(space_deduplicated) ```

text <ChemicalSpace: 10 molecules | 10 indices | No scores>

Chunking

A ChemicalSpace object can be chunked into smaller ChemicalSpace objects.
The .chunks method returns a generator of ChemicalSpace objects.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.fromsmi("tests/data/inputs1.smi") chunks = space.chunks(chunksize=3)

for chunk in chunks: print(chunk) ```

Drawing

A ChemicalSpace object can be rendered as a grid of molecules.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi") space.draw() ``` draw

Featurizing

Features

A ChemicalSpace object can be featurized as a numpy array of features. By default, ECFP4/Morgan2 fingerprints are used. The features are cached for subsequent calls, and spaces generated by a ChemicalSpace object (e.g. by slicing, masking, chunking) inherit the respective features.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.fromsmi("tests/data/inputs1.smi") spaceslice = space[:6:2]

Custom ECFP4 features

print(space.features.shape) print(space_slice.features.shape) ```

text (10, 1024) (3, 1024)

Custom featurizer

This should take in a rdkit.Chem.Mol molecule, and the numerical return value should be castable to NumPy array (see chemicalspace.utils.MolFeaturizerType).

```python from chemicalspace import ChemicalSpace from chemicalspace.utils import maccs_featurizer

space = ChemicalSpace.fromsmi("tests/data/inputs1.smi", featurizer=maccsfeaturizer) space_slice = space[:6:2]

Custom ECFP4 features

print(space.features.shape) print(space_slice.features.shape) ```

text (10, 167) (3, 167)

Metrics

A distance metric on the feature space is necessary for clustering, calculating diversity, and identifying neighbors. By default, the jaccard (a.k.a Tanimoto) distance is used. ChemicalSpace takes a metric string argument that allows to specify a sklearn metric.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi", metric='euclidean') ```

Binary Operations

Single entries

Single entries as SMILES strings or RDKit molecules can be added to a ChemicalSpace object.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi") space.add("CCO", "mol11")

print(space) ```

text <ChemicalSpace: 11 molecules | 11 indices | No scores>

Chemical spaces

Two ChemicalSpace objects can be added together.

```python from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.fromsmi("tests/data/inputs1.smi") space2 = ChemicalSpace.fromsmi("tests/data/inputs2.smi")

space = space1 + space2

print(space) ```

text <ChemicalSpace: 25 molecules | 25 indices | No scores>

And subtracted from each other to return only molecules in space1 that are not in space2.
See Hashing and Identity for more details.

```python from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.fromsmi("tests/data/inputs1.smi") space2 = ChemicalSpace.fromsmi("tests/data/inputs2.smi")

space = space1 - space2

print(space) ```

text <ChemicalSpace: 5 molecules | 5 indices | No scores>

Hashing and Identity

Individual molecules in a chemical space are hashed by their InChI Keys only (by default), or by InChI Keys and index. Scores do not affect the hashing process.

```python from chemicalspace import ChemicalSpace

smiles = ('CCO', 'CCN', 'CCl') indices = ("mol1", "mol2", "mol3")

Two spaces with the same molecules, and indices

But one space includes the indices in the hashing process

spaceindices = ChemicalSpace(mols=smiles, indices=indices, hashindices=True) spacenoindices = ChemicalSpace(mols=smiles, indices=indices, hash_indices=False)

print(spaceindices == spaceindices) print(spaceindices == spacenoindices) print(spacenoindices == spaceno_indices) ```

text True False True

ChemicalSpace objects are hashed by their molecular hashes, in an order-independent manner.

```python from rdkit import Chem from rdkit.Chem import inchi from chemicalspace import ChemicalSpace

mol = Chem.MolFromSmiles("c1ccccc1") inchi_key = inchi.MolToInchiKey(mol)

space = ChemicalSpace(mols=(mol,))

assert hash(space) == hash(frozenset((inchi_key,))) ```

The identity of a ChemicalSpace is evaluated on its hashed representation.

```python from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.fromsmi("tests/data/inputs1.smi") space1again = ChemicalSpace.fromsmi("tests/data/inputs1.smi") space2 = ChemicalSpace.fromsmi("tests/data/inputs2.smi.gz")

print(space1 == space1) print(space1 == space1_again) print(space1 == space2) ```

text True True False

Copy

ChemicalSpace supports copy and deepcopy operations. Deepcopy allows to fully unlink the copied object from the original one, including the RDKit molecules.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

Shallow copy

spacecopy = space.copy() assert id(space.mols[0]) == id(spacecopy.mols[0])

Deep copy

spacedeepcopy = space.copy(deep=True) assert id(space.mols[0]) != id(spacedeepcopy.mols[0]) ```

Clustering

Labels

A ChemicalSpace can be clustered using by its molecular features. kmedoids, agglomerative-clustering, sphere-exclusion and scaffold are the available clustering methods. Refer to the respective methods in chemicalspace.layers.clustering for more details.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.fromsmi("tests/data/inputs1.smi") clusterlabels = space.cluster(n_clusters=3)

print(cluster_labels) ```

text [0 1 2 1 1 0 0 0 0 0]

Clusters

ChemicalSpace.yield_clusters can be used to iterate clusters as ChemicalSpace objects.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.fromsmi("tests/data/inputs1.smi") clusters = space.yieldclusters(n_clusters=3)

for cluster in clusters: print(cluster) ```

KFold Clustering

ChemicalSpace.splits can be used to iterate train/test cluster splits for ML training. At each iteration, one cluster is used as the test set and the rest as the training set. Note that there is no guarantee on the size of the clusters.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

for train, test in space.split(n_splits=3): print(train, test) ```

Overlap

ChemicalSpace implements methods for calculating the overlap with another space.

Overlap

The molecules of a ChemicalSpace that are similar to another space can be flagged. The similarity between two molecules is calculated by the Tanimoto similarity of their ECFP4/Morgan2 fingerprints.

```python from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.fromsmi("tests/data/inputs1.smi") space2 = ChemicalSpace.fromsmi("tests/data/inputs2.smi.gz")

Indices of `space1` that are similar to `space2`

overlap = space1.find_overlap(space2, radius=0.6)

print(overlap) ```

text [0 1 2 3 4]

Carving

The overlap between two ChemicalSpace objects can be carved out from one of the objects, so to ensure that the two spaces are disjoint for a given similarity radius.

```python from chemicalspace import ChemicalSpace

space1 = ChemicalSpace.fromsmi("tests/data/inputs1.smi") space2 = ChemicalSpace.fromsmi("tests/data/inputs2.smi.gz")

Carve out the overlap from `space1`

space1_carved = space1.carve(space2, radius=0.6)

print(space1_carved) ```

text <ChemicalSpace: 5 molecules | 5 indices | No scores>

Dimensionality Reduction

ChemicalSpace implements methods for dimensionality reduction by pca, tsne or umap projection of its features.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi") proj = space.project(method='pca')

print(proj.shape) ```

text (10, 2)

Picking

A subset of a ChemicalSpace can be picked by a number of acquisition strategies.
See chemicalspace.layers.acquisition for details.

```python from chemicalspace import ChemicalSpace import numpy as np

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

spacepickrandom = space.pick(n=3, strategy='random') print(spacepickrandom)

spacepickdiverse = space.pick(n=3, strategy='maxmin') print(spacepickdiverse)

space.scores = np.array(range(len(space))) # Assign dummy scores spacepickgreedy = space.pick(n=3, strategy='greedy') print(spacepickgreedy) ```

text <ChemicalSpace: 3 molecules | 3 indices | No scores> <ChemicalSpace: 3 molecules | 3 indices | 3 scores>

Uniqueness and Diversity

Uniqueness

The uniqueness of a ChemicalSpace object can be calculated by the number of unique molecules.

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.fromsmi("tests/data/inputs1.smi") spacetwice = space + space # 20 molecules uniqueness = space_twice.uniqueness()

print(uniqueness) ```

text 0.5

Diversity

The diversity of a ChemicalSpace object can be calculated as:

The average of the pairwise distance matrix
The normalized Vendi score of the same distance matrix.

The Vendi score can be interpreted as the effective number of molecules in the space, and here it is normalized by the number of molecules in the space taking values in the range [0, 1].

```python from chemicalspace import ChemicalSpace

space = ChemicalSpace.from_smi("tests/data/inputs1.smi")

diversityint = space.diversity(method='internal-distance') diversityvendi = space.diversity(method='vendi') print(diversityint) print(diversityvendi)

Dummy space with the same molecule len(space) times

space_redundant = ChemicalSpace(mols=tuple([space.mols[0]] * len(space)))

diversityintredundant = spaceredundant.diversity(method='internal-distance') diversityvendiredundant = spaceredundant.diversity(method='vendi')

print(diversityintredundant) print(diversityvendiredundant) ```

text 0.7730273985449335 0.12200482273434754 0.0 0.1

Advanced

Layers

ChemicalSpace is implemented as a series of layers that provide the functionality of the class. As can be seen in the source code, the class simply combines the layers.

If only a subset of the functionality of ChemicalSpace is necessary, and lean objects are a priority, one can combine only the required layers:

```python from chemicalspace.layers.clustering import ChemicalSpaceClusteringLayer from chemicalspace.layers.neighbors import ChemicalSpaceNeighborsLayer

class MyCustomSpace(ChemicalSpaceClusteringLayer, ChemicalSpaceNeighborsLayer): pass

space = MyCustomSpace(mols=["c1ccccc1"]) space text ```

Development

Installation

Install the development dependencies with pip:

bash pip install -e .[dev]

Hooks

The project uses pre-commit for code formatting, linting and testing. Install the hooks with:

bash pre-commit install

Documentation

The documentation can be built by running: bash cd docs ./rebuild.sh

Owner

Name: Giulio Mattedi
Login: gmattedi
Kind: user
Location: London
Company: BenevolentAI

Twitter: GiulioMattedi
Repositories: 1
Profile: https://github.com/gmattedi

Senior Computational Chemist @ BenevolentAI

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this package, please cite it as below."
authors:
- family-names: "Mattedi"
  given-names: "Giulio"
  orcid: "https://orcid.org/0000-0002-7290-694X"
title: "ChemicalSpace"
version: 0.1.1
url: "https://github.com/gmattedi/chemicalspace"

GitHub Events

Total

Last Year

Packages

Total packages: 1
Total downloads:
- pypi 13 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

pypi.org: chemicalspace

An Object-oriented Representation for Chemical Spaces

Homepage: https://github.com/gmattedi/chemicalspace
Documentation: https://chemicalspace.readthedocs.io/
License: MIT License Copyright (c) 2024 Giulio Mattedi Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Latest release: 0.1.1
published over 1 year ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 13 Last month

Rankings

Dependent packages count: 10.8%

Average: 36.0%

Dependent repos count: 61.1%

Maintainers (1)

gmattedi

Last synced: 6 months ago

Dependencies

pyproject.toml pypi

requirements-dev.txt pypi

coverage >=7.5.3 development
ipykernel >=6.29.4 development
matplotlib >=3.9.0 development
myst-parser >=3.0.1 development
pre-commit >=3.7.1 development
pyright >=1.1.365 development
pytest >=8.2.1 development
rdkit-stubs >=0.7 development
sphinx >=7.3.7 development
sphinx-rtd-theme >=2.0.0 development
sphinxcontrib-serializinghtml >=1.1.10 development
twine >=5.1.0 development

requirements.txt pypi

joblib >=1.4.2
numpy >=1.26.4
rdkit >=2023.9.6
scikit-learn >=1.5.0
scikit-learn-extra >=0.3.0
scipy >=1.13.1
typing_extensions >=4.12.1
umap-learn >=0.5.6

chemicalspace

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ChemicalSpace

Installation

Usage

Basics

Initialization

Reading and Writing

Load from SMI file

Load from SDF file

Indexing, Slicing and Masking

Indexing

Slicing

Masking

Deduplicating

Chunking

Drawing

Featurizing

Features

Custom ECFP4 features

Custom featurizer

Custom ECFP4 features

Metrics

Binary Operations

Single entries

Chemical spaces

Hashing and Identity

Two spaces with the same molecules, and indices

But one space includes the indices in the hashing process

Copy

Shallow copy

Deep copy

Clustering

Labels

Clusters

KFold Clustering

Overlap

Overlap

Indices of space1 that are similar to space2

Carving

Carve out the overlap from space1

Dimensionality Reduction

Picking

Uniqueness and Diversity

Uniqueness

Diversity

Dummy space with the same molecule len(space) times

Advanced

Layers

Development

Installation

Hooks

Documentation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: chemicalspace

Rankings

Maintainers (1)

Dependencies

Indices of `space1` that are similar to `space2`

Carve out the overlap from `space1`