scikit-mol
scikit-learn classes for molecular vectorization using RDKit
Science Score: 77.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org, sciencedirect.com, acs.org -
✓Committers with academic emails
1 of 20 committers (5.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Repository
scikit-learn classes for molecular vectorization using RDKit
Basic Info
- Host: GitHub
- Owner: EBjerrum
- License: lgpl-3.0
- Language: Python
- Default Branch: main
- Size: 9.22 MB
Statistics
- Stars: 193
- Watchers: 3
- Forks: 29
- Open Issues: 6
- Releases: 15
Metadata Files
README.md
scikit-mol

Scikit-Learn classes for molecular vectorization using RDKit
The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings
As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mollisttrain and _test lists:
pipe = Pipeline([('mol_transformer', MorganFingerprintTransformer()), ('Regressor', Ridge())])
pipe.fit(mol_list_train, y_train)
pipe.score(mol_list_test, y_test)
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])
>>> array([4.93858815])
The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities
The first draft for the project was created at the RDKIT UGM 2022 hackathon 2022-October-14
Installation
Users can install latest tagged release from pip
sh
pip install scikit-mol
or from conda-forge
sh
conda install -c conda-forge scikit-mol
The conda forge package should get updated shortly after a new tagged release on pypi.
Bleeding edge
sh
pip install git+https://github.com/EBjerrum/scikit-mol.git
Documentation
Example notebooks and API documentation are now hosted on https://scikit-mol.readthedocs.io
- Basic Usage and fingerprint transformers
- Descriptor transformer
- Pipelining with Scikit-Learn classes
- Molecular standardization
- Sanitizing SMILES input
- Integrated hyperparameter tuning of Scikit-Learn estimator and Scikit-Mol transformer
- Using parallel execution to speed up descriptor and fingerprint calculations
- Using skopt for hyperparameter tuning
- Testing different fingerprints as part of the hyperparameter optimization
- Using pandas output for easy feature importance analysis and combine pre-existing values with new computations
- Working with pipelines and estimators in safe inference mode for handling prediction on batches with invalid smiles or molecules
- Creating custom fingerprint transformers
- Estimating applicability domain using feature based estimators
We also put a software note on ChemRxiv. https://doi.org/10.26434/chemrxiv-2023-fzqwd
Other use-examples
Scikit-Mol has been featured in blog-posts or used in research, some examples which are listed below:
- Useful ML package for cheminformatics iwatobipen.wordpress.com
- Boosted trees Datainlife_blog
- Konnektor: A Framework for Using Graph Theory to Plan Networks for Free Energy Calculations
- Moldrug algorithm for an automated ligand binding site exploration by 3D aware molecular enumerations
- RandomNets Improve Neural Network Regression Performance via Implicit Ensembling
- WAE-DTI: Ensemble-based architecture for drug–target interaction prediction using descriptors and embeddings
- Data Driven Estimation of Molecular Log-Likelihood using Fingerprint Key Counting
- AUTONOMOUS DRUG DISCOVERY
- DrugGym: A testbed for the economics of autonomous drug discovery
Roadmap and Contributing
Help wanted! Are you a PhD student that want a "side-quest" to procrastinate your thesis writing or are you interested in computational chemistry, cheminformatics or simply with an interest in QSAR modelling, Python Programming open-source software? Do you want to learn more about machine learning with Scikit-Learn? Or do you use scikit-mol for your current work and would like to pay a little back to the project and see it improved as well? With a little bit of help, this project can be improved much faster! Reach to me (Esben), for a discussion about how we can proceed.
Currently, we are working on fixing some deprecation warnings, it's not the most exciting work, but it's important to maintain a little. Later on we need to go over the scikit-learn compatibility and update to some of their newer features on their estimator classes. We're also brewing on some feature enhancements and tests, such as new fingerprints and a more versatile standardizer.
There are more information about how to contribute to the project in CONTRIBUTING
BUGS
Probably still, please check issues at GitHub and report there
Contributors
Scikit-Mol has been developed as a community effort with contributions from people from many different companies, consortia, foundations and academic institutions.
Cheminformania Consulting, Aptuit, BASF, Bayer AG, Boehringer Ingelheim, Chodera Lab (MSKCC), EPAM Systems,ETH Zürich, Evotec, Johannes Gutenberg University, Martin Luther University, Odyssey Therapeutics, Open Molecular Software Foundation, Openfree.energy, Polish Academy of Sciences, Productivista, Simulations-Plus Inc., University of Vienna
- Esben Jannik Bjerrum @ebjerrum, esbenbjerrum+scikit_mol@gmail.com
- Carmen Esposito @cespos
- Son Ha @son-ha-264
- Oh-hyeon Choung @Ohyeon5
- Andreas Poehlmann @ap--
- Ya Chen @anya-chen
- Anton Siomchen @asiomchen
- Rafał Bachorz @rafalbachorz
- Adrien Chaton @adrienchaton
- @VincentAlexanderScholz
- @RiesBen
- @enricogandini
- @mikemhenry
- @c-feldmann
- Mieczyslaw Torchala @mieczyslaw
- Kyle Barbary @kbarbary
Owner
- Name: Esben Jannik Bjerrum
- Login: EBjerrum
- Kind: user
- Location: Sweden
- Company: Odyssey Thereapeutics
- Website: www.odysseytx.com
- Twitter: chemitnerf
- Repositories: 6
- Profile: https://github.com/EBjerrum
https://www.cheminformania.com/about/esben-jannik-bjerrum/ https://www.linkedin.com/in/esbenbjerrum Mastodon @ChemITNerf@sigmoid.social
Citation (CITATION.bib)
@article{bjerrum_scikit-mol_2023,
title = {Scikit-{Mol} brings cheminformatics to {Scikit}-{Learn}},
author = {Bjerrum, Esben Jannik and Bachorz, Rafał Adam and Bitton, Adrien and Choung, Oh-hyeon and Chen, Ya and Esposito, Carmen and Ha, Son Viet and Poehlmann, Andreas},
year = {2023},
month = dec,
journal = {ChemRxiv},
url = {https://chemrxiv.org/engage/chemrxiv/article-details/60ef0fc58825826143a82cc0},
doi = {10.26434/chemrxiv-2023-fzqwd},
abstract = {Scikit-Mol is a open-source toolkit that aims to bridge the gap between two well-established toolkits, RDKit and Scikit-Learn, in order to provide a simple interface for building cheminformatics models. By leveraging the strengths of both RDKit and Scikit-Learn, Scikit-Mol provides a powerful platform for creating predictive modeling in drug discovery and materials design. Unlike other toolkits that often integrate both chemistry and machine learning, Scikit-Mol rather aims to be a simple bridge between the two, reducing the maintenance effort required to keep up with changes and new features in e.g. Scikit-Learn. A simple example of Scikit-Mol's functionality is provided, demonstrating its compatibility with Scikit-Learn pipelines. Overall, Scikit-Mol provides a useful and flexible package for building self-contained and self-documented cheminformatics models with minimal maintenance required.},
language = {en},
urldate = {2023-12-06},
keywords = {Cheminformatics, Descriptors, Fingerprints, Machine Learning, RDKit, Scikit-Learn},
note = {preprint}
}
GitHub Events
Total
- Create event: 18
- Release event: 7
- Issues event: 13
- Watch event: 85
- Delete event: 8
- Member event: 1
- Issue comment event: 68
- Push event: 32
- Pull request event: 37
- Pull request review comment event: 23
- Pull request review event: 31
- Fork event: 11
Last Year
- Create event: 18
- Release event: 7
- Issues event: 13
- Watch event: 85
- Delete event: 8
- Member event: 1
- Issue comment event: 68
- Push event: 32
- Pull request event: 37
- Pull request review comment event: 23
- Pull request review event: 31
- Fork event: 11
Committers
Last synced: 10 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Esben Jannik Bjerrum | e****k@r****m | 99 |
| Anton Siomchen | 4****n | 61 |
| EBjerrum | e****b@g****m | 58 |
| Enrico Gandini | e****3@g****m | 35 |
| Esben Bjerrm | e****n@o****m | 23 |
| Christian W. Feldmann | c****n@g****m | 13 |
| Esben Jannik Bjerrum | 1****m | 13 |
| riesben | b****s@o****m | 11 |
| Mieczyslaw Torchala | m****a@g****m | 8 |
| Rafal Bachorz | r****l@b****u | 7 |
| son-ha-264 | 7****4 | 4 |
| adrienchaton | 3****n | 4 |
| Andreas Poehlmann | a****s@p****o | 4 |
| Oh-hyeon Choung | i****i@g****m | 3 |
| Kyle Barbary | k****y@g****m | 2 |
| cespos | 3****s | 2 |
| Ekaterina (Katja) Bjerrum | 4****a | 2 |
| Rafal Bachorz | r****z@s****m | 1 |
| Mike Henry | 1****y | 1 |
| Ya Chen | y****n@u****t | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 25
- Total pull requests: 69
- Average time to close issues: 3 months
- Average time to close pull requests: 6 days
- Total issue authors: 9
- Total pull request authors: 17
- Average comments per issue: 2.08
- Average comments per pull request: 1.65
- Merged pull requests: 57
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 12
- Pull requests: 41
- Average time to close issues: 16 days
- Average time to close pull requests: 6 days
- Issue authors: 7
- Pull request authors: 9
- Average comments per issue: 2.92
- Average comments per pull request: 1.66
- Merged pull requests: 31
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- EBjerrum (15)
- asiomchen (2)
- mikemhenry (2)
- marcosfelt (1)
- UnixJunkie (1)
- atravitz (1)
- FDUguchunhui (1)
- RiesBen (1)
- enricogandini (1)
Pull Request Authors
- EBjerrum (23)
- asiomchen (16)
- saleha1wer (5)
- enricogandini (4)
- mieczyslaw (4)
- RiesBen (3)
- rafalbachorz (3)
- ap-- (2)
- son-ha-264 (2)
- Ohyeon5 (2)
- anya-chen (2)
- c-feldmann (2)
- kbarbary (2)
- Productivista (2)
- mikemhenry (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 962 last-month
- Total dependent packages: 1
- Total dependent repositories: 0
- Total versions: 24
- Total maintainers: 1
pypi.org: scikit-mol
scikit-learn classes for molecule transformation
- Homepage: https://github.com/EBjerrum/scikit-mol
- Documentation: https://scikit-mol.readthedocs.io/
- License: Apache Software License
-
Latest release: 0.6.1
published 9 months ago
Rankings
Maintainers (1)
Dependencies
- actions/cache v3 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- pypa/gh-action-pypi-publish release/v1 composite