alinemol
Exploring performance of machine learning model on out-of-distribution data in chemical domain
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Keywords
Repository
Exploring performance of machine learning model on out-of-distribution data in chemical domain
Basic Info
- Host: GitHub
- Owner: HFooladi
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://hfooladi.github.io/ALineMol/
- Size: 20.4 MB
Statistics
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
ALineMol: Evaluating Machine Learning Models for Molecular Property Prediction on OOD Data
Overview
ALineMol is a comprehensive research framework for evaluating and quantitatively assessing the relationship between machine learning model performance on in-distribution (ID) and out-of-distribution (OOD) data in the molecular domain. This work addresses critical questions in AI-driven drug discovery about model generalization to novel chemical structures.
Key Contributions
🔬 Comprehensive Evaluation: Systematic assessment of ML models (classical ML + GNNs) across multiple datasets using different splitting strategies
📊 Distribution Shift Analysis: Quantitative investigation of what constitutes "out-of-distribution" data in molecular property prediction
🎯 ID-OOD Relationship: Deep analysis of correlation between in-distribution and out-of-distribution performance across different scenarios
⚗️ Drug Discovery Focus: Practical insights for molecular property prediction and bioactivity classification in pharmaceutical research
Setup
```bash
Clone the repository
git clone https://github.com/HFooladi/ALineMol.git cd ALineMol
Create and activate conda environment
conda env create -f environment.yml conda activate alinemol
Install ALineMol package
pip install --no-deps -e . ```
Quick Start
Basic Usage
```python import pandas as pd from alinemol.preprocessing import standardizationpipeline from alinemol.splitters import ScaffoldSplit, MolecularWeightSplit from alinemol.utils import computesimilarities
Load and preprocess data
df = pd.readcsv("yourdataset.csv") # Columns: 'smiles', 'label' dfclean = standardizationpipeline(df)
Create different types of splits
scaffoldsplitter = ScaffoldSplit(testsize=0.2) weightsplitter = MolecularWeightSplit(testsize=0.2, generalizetolarger=True)
Evaluate different splitting strategies
for trainidx, testidx in scaffoldsplitter.split(dfclean['smiles']): traindata = dfclean.iloc[trainidx] testdata = dfclean.iloc[testidx]
# Compute molecular similarities
similarities = compute_similarities(
train_data['smiles'],
test_data['smiles'],
fingerprint='ecfp',
fprints_hopts={'radius': 2, 'fpSize': 1024}
)
print(f"Average train-test similarity: {similarities.mean():.3f}")
```
Comprehensive Evaluation Pipeline
```python from alinemol.utils import loaddataset, splitdataset, computeIDOOD from alinemol.utils.plotutils import plotIDOODsns, heatmap_plot
Evaluate multiple models across different split types
results = computeIDOOD( datasetcategory="TDC", datasetnames="CYP2C19", splittype="scaffold", numof_splits=10 )
Visualize ID vs OOD performance
plotIDOODsns(results, datasetname="CYP2C19", save=True)
Create performance heatmaps
heatmapplot(results, metric="rocauc", save=True) ```
Splitting Strategies
ALineMol implements various molecular splitting strategies to simulate different types of distribution shift:
1. Structure-Based Splits
```python from alinemol.splitters import ScaffoldSplit, PerimeterSplit
Bemis-Murcko scaffold splitting
scaffoldsplit = ScaffoldSplit(makegeneric=True)
Perimeter-based clustering
perimetersplit = PerimeterSplit(nclusters=10) ```
2. Property-Based Splits
```python from alinemol.splitters import MolecularWeightSplit, MolecularLogPSplit
Split by molecular weight (test on larger molecules)
mwsplit = MolecularWeightSplit(generalizeto_larger=True)
Split by lipophilicity
logpsplit = MolecularLogPSplit(generalizeto_larger=True) ```
3. Similarity-Based Splits
```python from alinemol.splitters.lohi import HiSplit, LoSplit
Hi-split: ensures low similarity between train/test
hisplit = HiSplit( similaritythreshold=0.4, trainminfrac=0.7, testminfrac=0.15 )
Lo-split: for lead optimization scenarios
losplit = LoSplit( threshold=0.4, minclustersize=5, stdthreshold=0.6 ) ```
4. Clustering-Based Splits
```python from alinemol.splitters import UMAPSplit, KMeansSplit
UMAP + clustering split
umapsplit = UMAPSplit( nclusters=20, nneighbors=100, mindist=0.1 )
K-means clustering split
kmeanssplit = KMeansSplit(nclusters=10, metric="jaccard") ```
Development
Tests
Run the test suite with pytest:
bash
pytest
Code Style
We use ruff for linting and formatting:
```bash
Check code style
ruff check
Format code
ruff format ```
Documentation
Build and serve the documentation locally:
bash
mkdocs serve
Continuous Integration
This project uses GitHub Actions for continuous integration and deployment:
- CI Workflow: Automatically runs tests and linting on all pull requests and pushes to the main branch
- Release Workflow: Automatically builds and publishes the package to PyPI when a new release is created
To create a new release:
- Update the version in
_version.py - Create a new tag and GitHub release
- The release workflow will automatically publish to PyPI
Citation
If you find ALineMol useful in your research, please cite the following paper:
bibtex
@article{fooladi2025evaluating,
title={Evaluating Machine Learning Models for Molecular Property Prediction: Performance and Robustness on Out-of-Distribution Data},
author={Fooladi, Hosein and Vu, Thi Ngoc Lan and Kirchmair, Johannes},
year={2025},
doi = {https://doi.org/10.26434/chemrxiv-2025-g1vjf-v2}
}
Related Work
- Splito: Molecular splitting library - GitHub
- TDC: Therapeutics Data Commons - Website
- DGL-LifeSci: Deep Graph Library for Life Sciences - GitHub
Documentation
Contributing
We welcome contributions! Please see our Contributing Guidelines for details on:
- Reporting bugs
- Suggesting enhancements
- Submitting pull requests
- Code style guidelines
License
This project is licensed under the MIT License - see the LICENSE file for details.
Owner
- Name: Hosein Fooladi
- Login: HFooladi
- Kind: user
- Website: https://hfooladi.github.io/
- Repositories: 2
- Profile: https://github.com/HFooladi
Machine Learning researcher. Deep learning for drug discovery. Finding bugs in intelligence. Interested in Artificial Life. @ShenakhtPajouh
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Fooladi
given-names: Hosein
orcid: https://orcid.org/0000-0002-3124-2761
title: "Evaluating Machine Learning Models for Molecular Property Prediction: Performance and Robustness on Out-of-Distribution Data"
version: 0.1.0
identifiers:
- type: doi
value:
date-released: 2025-03-01
GitHub Events
Total
- Watch event: 2
- Public event: 1
- Push event: 35
- Create event: 2
Last Year
- Watch event: 2
- Public event: 1
- Push event: 35
- Create event: 2
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- biopython *
- chembl-webresource-client *
- rdkit *
- matplotlib *
- numpy *
- pandas *
- scikit-learn *
- seaborn *