alinemol

Exploring performance of machine learning model on out-of-distribution data in chemical domain

https://github.com/hfooladi/alinemol

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary

Keywords

cheminformatics generalization molecular-property-prediction out-of-distribution
Last synced: 6 months ago · JSON representation ·

Repository

Exploring performance of machine learning model on out-of-distribution data in chemical domain

Basic Info
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
cheminformatics generalization molecular-property-prediction out-of-distribution
Created about 2 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

ALineMol: Evaluating Machine Learning Models for Molecular Property Prediction on OOD Data

License: MIT CI DOI

ALineMol Banner

Overview

ALineMol is a comprehensive research framework for evaluating and quantitatively assessing the relationship between machine learning model performance on in-distribution (ID) and out-of-distribution (OOD) data in the molecular domain. This work addresses critical questions in AI-driven drug discovery about model generalization to novel chemical structures.

Key Contributions

🔬 Comprehensive Evaluation: Systematic assessment of ML models (classical ML + GNNs) across multiple datasets using different splitting strategies

📊 Distribution Shift Analysis: Quantitative investigation of what constitutes "out-of-distribution" data in molecular property prediction

🎯 ID-OOD Relationship: Deep analysis of correlation between in-distribution and out-of-distribution performance across different scenarios

⚗️ Drug Discovery Focus: Practical insights for molecular property prediction and bioactivity classification in pharmaceutical research

Setup

```bash

Clone the repository

git clone https://github.com/HFooladi/ALineMol.git cd ALineMol

Create and activate conda environment

conda env create -f environment.yml conda activate alinemol

Install ALineMol package

pip install --no-deps -e . ```

Quick Start

Basic Usage

```python import pandas as pd from alinemol.preprocessing import standardizationpipeline from alinemol.splitters import ScaffoldSplit, MolecularWeightSplit from alinemol.utils import computesimilarities

Load and preprocess data

df = pd.readcsv("yourdataset.csv") # Columns: 'smiles', 'label' dfclean = standardizationpipeline(df)

Create different types of splits

scaffoldsplitter = ScaffoldSplit(testsize=0.2) weightsplitter = MolecularWeightSplit(testsize=0.2, generalizetolarger=True)

Evaluate different splitting strategies

for trainidx, testidx in scaffoldsplitter.split(dfclean['smiles']): traindata = dfclean.iloc[trainidx] testdata = dfclean.iloc[testidx]

# Compute molecular similarities
similarities = compute_similarities(
    train_data['smiles'], 
    test_data['smiles'],
    fingerprint='ecfp',
    fprints_hopts={'radius': 2, 'fpSize': 1024}
)
print(f"Average train-test similarity: {similarities.mean():.3f}")

```

Comprehensive Evaluation Pipeline

```python from alinemol.utils import loaddataset, splitdataset, computeIDOOD from alinemol.utils.plotutils import plotIDOODsns, heatmap_plot

Evaluate multiple models across different split types

results = computeIDOOD( datasetcategory="TDC", datasetnames="CYP2C19", splittype="scaffold", numof_splits=10 )

Visualize ID vs OOD performance

plotIDOODsns(results, datasetname="CYP2C19", save=True)

Create performance heatmaps

heatmapplot(results, metric="rocauc", save=True) ```

Splitting Strategies

ALineMol implements various molecular splitting strategies to simulate different types of distribution shift:

1. Structure-Based Splits

```python from alinemol.splitters import ScaffoldSplit, PerimeterSplit

Bemis-Murcko scaffold splitting

scaffoldsplit = ScaffoldSplit(makegeneric=True)

Perimeter-based clustering

perimetersplit = PerimeterSplit(nclusters=10) ```

2. Property-Based Splits

```python from alinemol.splitters import MolecularWeightSplit, MolecularLogPSplit

Split by molecular weight (test on larger molecules)

mwsplit = MolecularWeightSplit(generalizeto_larger=True)

Split by lipophilicity

logpsplit = MolecularLogPSplit(generalizeto_larger=True) ```

3. Similarity-Based Splits

```python from alinemol.splitters.lohi import HiSplit, LoSplit

Hi-split: ensures low similarity between train/test

hisplit = HiSplit( similaritythreshold=0.4, trainminfrac=0.7, testminfrac=0.15 )

Lo-split: for lead optimization scenarios

losplit = LoSplit( threshold=0.4, minclustersize=5, stdthreshold=0.6 ) ```

4. Clustering-Based Splits

```python from alinemol.splitters import UMAPSplit, KMeansSplit

UMAP + clustering split

umapsplit = UMAPSplit( nclusters=20, nneighbors=100, mindist=0.1 )

K-means clustering split

kmeanssplit = KMeansSplit(nclusters=10, metric="jaccard") ```

Development

Tests

Run the test suite with pytest:

bash pytest

Code Style

We use ruff for linting and formatting:

```bash

Check code style

ruff check

Format code

ruff format ```

Documentation

Build and serve the documentation locally:

bash mkdocs serve

Continuous Integration

This project uses GitHub Actions for continuous integration and deployment:

  • CI Workflow: Automatically runs tests and linting on all pull requests and pushes to the main branch
  • Release Workflow: Automatically builds and publishes the package to PyPI when a new release is created

To create a new release:

  1. Update the version in _version.py
  2. Create a new tag and GitHub release
  3. The release workflow will automatically publish to PyPI

Citation

If you find ALineMol useful in your research, please cite the following paper:

bibtex @article{fooladi2025evaluating, title={Evaluating Machine Learning Models for Molecular Property Prediction: Performance and Robustness on Out-of-Distribution Data}, author={Fooladi, Hosein and Vu, Thi Ngoc Lan and Kirchmair, Johannes}, year={2025}, doi = {https://doi.org/10.26434/chemrxiv-2025-g1vjf-v2} }

Related Work

  • Splito: Molecular splitting library - GitHub
  • TDC: Therapeutics Data Commons - Website
  • DGL-LifeSci: Deep Graph Library for Life Sciences - GitHub

Documentation

Contributing

We welcome contributions! Please see our Contributing Guidelines for details on: - Reporting bugs - Suggesting enhancements
- Submitting pull requests - Code style guidelines

License

This project is licensed under the MIT License - see the LICENSE file for details.

Owner

  • Name: Hosein Fooladi
  • Login: HFooladi
  • Kind: user

Machine Learning researcher. Deep learning for drug discovery. Finding bugs in intelligence. Interested in Artificial Life. @ShenakhtPajouh

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Fooladi
    given-names: Hosein
    orcid: https://orcid.org/0000-0002-3124-2761
title: "Evaluating Machine Learning Models for Molecular Property Prediction: Performance and Robustness on Out-of-Distribution Data"
version: 0.1.0
identifiers:
  - type: doi
    value: 
date-released: 2025-03-01

GitHub Events

Total
  • Watch event: 2
  • Public event: 1
  • Push event: 35
  • Create event: 2
Last Year
  • Watch event: 2
  • Public event: 1
  • Push event: 35
  • Create event: 2

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

environment.yml pypi
  • biopython *
  • chembl-webresource-client *
  • rdkit *
pyproject.toml pypi
  • matplotlib *
  • numpy *
  • pandas *
  • scikit-learn *
  • seaborn *