alinemol

Exploring performance of machine learning model on out-of-distribution data in chemical domain

https://github.com/hfooladi/alinemol

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Keywords

cheminformatics generalization molecular-property-prediction out-of-distribution

Last synced: 6 months ago · JSON representation ·

Repository

Exploring performance of machine learning model on out-of-distribution data in chemical domain

Basic Info

Host: GitHub
Owner: HFooladi
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage: https://hfooladi.github.io/ALineMol/
Size: 20.4 MB

Statistics

Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Topics

cheminformatics generalization molecular-property-prediction out-of-distribution

Created about 2 years ago · Last pushed 9 months ago

Metadata Files

Readme License Citation

ALineMol: Evaluating Machine Learning Models for Molecular Property Prediction on OOD Data

Overview

ALineMol is a comprehensive research framework for evaluating and quantitatively assessing the relationship between machine learning model performance on in-distribution (ID) and out-of-distribution (OOD) data in the molecular domain. This work addresses critical questions in AI-driven drug discovery about model generalization to novel chemical structures.

Key Contributions

🔬 Comprehensive Evaluation: Systematic assessment of ML models (classical ML + GNNs) across multiple datasets using different splitting strategies

📊 Distribution Shift Analysis: Quantitative investigation of what constitutes "out-of-distribution" data in molecular property prediction

🎯 ID-OOD Relationship: Deep analysis of correlation between in-distribution and out-of-distribution performance across different scenarios

⚗️ Drug Discovery Focus: Practical insights for molecular property prediction and bioactivity classification in pharmaceutical research

Setup

```bash

Clone the repository

git clone https://github.com/HFooladi/ALineMol.git cd ALineMol

Create and activate conda environment

conda env create -f environment.yml conda activate alinemol

Install ALineMol package

pip install --no-deps -e . ```

Quick Start

Basic Usage

```python import pandas as pd from alinemol.preprocessing import standardizationpipeline from alinemol.splitters import ScaffoldSplit, MolecularWeightSplit from alinemol.utils import computesimilarities

Load and preprocess data

df = pd.readcsv("yourdataset.csv") # Columns: 'smiles', 'label' dfclean = standardizationpipeline(df)

Create different types of splits

scaffoldsplitter = ScaffoldSplit(testsize=0.2) weightsplitter = MolecularWeightSplit(testsize=0.2, generalizetolarger=True)

Evaluate different splitting strategies

for trainidx, testidx in scaffoldsplitter.split(dfclean['smiles']): traindata = dfclean.iloc[trainidx] testdata = dfclean.iloc[testidx]

# Compute molecular similarities
similarities = compute_similarities(
    train_data['smiles'], 
    test_data['smiles'],
    fingerprint='ecfp',
    fprints_hopts={'radius': 2, 'fpSize': 1024}
)
print(f"Average train-test similarity: {similarities.mean():.3f}")

```

Comprehensive Evaluation Pipeline

```python from alinemol.utils import loaddataset, splitdataset, computeIDOOD from alinemol.utils.plotutils import plotIDOODsns, heatmap_plot

Evaluate multiple models across different split types

results = computeIDOOD( datasetcategory="TDC", datasetnames="CYP2C19", splittype="scaffold", numof_splits=10 )

Visualize ID vs OOD performance

plotIDOODsns(results, datasetname="CYP2C19", save=True)

Create performance heatmaps

heatmapplot(results, metric="rocauc", save=True) ```

Splitting Strategies

ALineMol implements various molecular splitting strategies to simulate different types of distribution shift:

1. Structure-Based Splits

```python from alinemol.splitters import ScaffoldSplit, PerimeterSplit

Bemis-Murcko scaffold splitting

scaffoldsplit = ScaffoldSplit(makegeneric=True)

Perimeter-based clustering

perimetersplit = PerimeterSplit(nclusters=10) ```

2. Property-Based Splits

```python from alinemol.splitters import MolecularWeightSplit, MolecularLogPSplit

Split by molecular weight (test on larger molecules)

mwsplit = MolecularWeightSplit(generalizeto_larger=True)

Split by lipophilicity

logpsplit = MolecularLogPSplit(generalizeto_larger=True) ```

3. Similarity-Based Splits

```python from alinemol.splitters.lohi import HiSplit, LoSplit

Hi-split: ensures low similarity between train/test

hisplit = HiSplit( similaritythreshold=0.4, trainminfrac=0.7, testminfrac=0.15 )

Lo-split: for lead optimization scenarios

losplit = LoSplit( threshold=0.4, minclustersize=5, stdthreshold=0.6 ) ```

4. Clustering-Based Splits

```python from alinemol.splitters import UMAPSplit, KMeansSplit

UMAP + clustering split

umapsplit = UMAPSplit( nclusters=20, nneighbors=100, mindist=0.1 )

K-means clustering split

kmeanssplit = KMeansSplit(nclusters=10, metric="jaccard") ```

Development

Tests

Run the test suite with pytest:

bash pytest

Code Style

We use ruff for linting and formatting:

```bash

Check code style

ruff check

Format code

ruff format ```

Documentation

Build and serve the documentation locally:

bash mkdocs serve

Continuous Integration

This project uses GitHub Actions for continuous integration and deployment:

CI Workflow: Automatically runs tests and linting on all pull requests and pushes to the main branch
Release Workflow: Automatically builds and publishes the package to PyPI when a new release is created

To create a new release:

Update the version in _version.py
Create a new tag and GitHub release
The release workflow will automatically publish to PyPI

Citation

If you find ALineMol useful in your research, please cite the following paper:

bibtex @article{fooladi2025evaluating, title={Evaluating Machine Learning Models for Molecular Property Prediction: Performance and Robustness on Out-of-Distribution Data}, author={Fooladi, Hosein and Vu, Thi Ngoc Lan and Kirchmair, Johannes}, year={2025}, doi = {https://doi.org/10.26434/chemrxiv-2025-g1vjf-v2} }

Related Work

Splito: Molecular splitting library - GitHub
TDC: Therapeutics Data Commons - Website
DGL-LifeSci: Deep Graph Library for Life Sciences - GitHub

Documentation

Contributing

We welcome contributions! Please see our Contributing Guidelines for details on: - Reporting bugs - Suggesting enhancements
- Submitting pull requests - Code style guidelines

License

This project is licensed under the MIT License - see the LICENSE file for details.

Owner

Name: Hosein Fooladi
Login: HFooladi
Kind: user

Website: https://hfooladi.github.io/
Repositories: 2
Profile: https://github.com/HFooladi

Machine Learning researcher. Deep learning for drug discovery. Finding bugs in intelligence. Interested in Artificial Life. @ShenakhtPajouh

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Fooladi
    given-names: Hosein
    orcid: https://orcid.org/0000-0002-3124-2761
title: "Evaluating Machine Learning Models for Molecular Property Prediction: Performance and Robustness on Out-of-Distribution Data"
version: 0.1.0
identifiers:
  - type: doi
    value: 
date-released: 2025-03-01

GitHub Events

Total

Watch event: 2
Public event: 1
Push event: 35
Create event: 2

Last Year

Watch event: 2
Public event: 1
Push event: 35
Create event: 2

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

environment.yml pypi

biopython *
chembl-webresource-client *
rdkit *

pyproject.toml pypi

matplotlib *
numpy *
pandas *
scikit-learn *
seaborn *

alinemol

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ALineMol: Evaluating Machine Learning Models for Molecular Property Prediction on OOD Data

Overview

Key Contributions

Setup

Clone the repository

Create and activate conda environment

Install ALineMol package

Quick Start

Basic Usage

Load and preprocess data

Create different types of splits

Evaluate different splitting strategies

Comprehensive Evaluation Pipeline

Evaluate multiple models across different split types

Visualize ID vs OOD performance

Create performance heatmaps

Splitting Strategies

1. Structure-Based Splits

Bemis-Murcko scaffold splitting

Perimeter-based clustering

2. Property-Based Splits

Split by molecular weight (test on larger molecules)

Split by lipophilicity

3. Similarity-Based Splits

Hi-split: ensures low similarity between train/test

Lo-split: for lead optimization scenarios

4. Clustering-Based Splits

UMAP + clustering split

K-means clustering split

Development

Tests

Code Style

Check code style

Format code

Documentation

Continuous Integration

Citation

Related Work

Documentation

Contributing

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies