scikit-mol

scikit-learn classes for molecular vectorization using RDKit

https://github.com/ebjerrum/scikit-mol

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org, sciencedirect.com, acs.org
  • Committers with academic emails
    1 of 20 committers (5.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

scikit-learn classes for molecular vectorization using RDKit

Basic Info
  • Host: GitHub
  • Owner: EBjerrum
  • License: lgpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 9.22 MB
Statistics
  • Stars: 193
  • Watchers: 3
  • Forks: 29
  • Open Issues: 6
  • Releases: 15
Created over 3 years ago · Last pushed 7 months ago
Metadata Files
Readme Contributing License Citation Codeowners

README.md

scikit-mol

Scikit-Mol Logo

python versions

pypi version conda version license

powered by rdkit Ruff

Scikit-Learn classes for molecular vectorization using RDKit

The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings

As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mollisttrain and _test lists:

pipe = Pipeline([('mol_transformer', MorganFingerprintTransformer()), ('Regressor', Ridge())])
pipe.fit(mol_list_train, y_train)
pipe.score(mol_list_test, y_test)
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])

>>> array([4.93858815])

The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities

The first draft for the project was created at the RDKIT UGM 2022 hackathon 2022-October-14

Installation

Users can install latest tagged release from pip

sh pip install scikit-mol

or from conda-forge

sh conda install -c conda-forge scikit-mol

The conda forge package should get updated shortly after a new tagged release on pypi.

Bleeding edge

sh pip install git+https://github.com/EBjerrum/scikit-mol.git

Documentation

Example notebooks and API documentation are now hosted on https://scikit-mol.readthedocs.io

We also put a software note on ChemRxiv. https://doi.org/10.26434/chemrxiv-2023-fzqwd

Other use-examples

Scikit-Mol has been featured in blog-posts or used in research, some examples which are listed below:

Roadmap and Contributing

Help wanted! Are you a PhD student that want a "side-quest" to procrastinate your thesis writing or are you interested in computational chemistry, cheminformatics or simply with an interest in QSAR modelling, Python Programming open-source software? Do you want to learn more about machine learning with Scikit-Learn? Or do you use scikit-mol for your current work and would like to pay a little back to the project and see it improved as well? With a little bit of help, this project can be improved much faster! Reach to me (Esben), for a discussion about how we can proceed.

Currently, we are working on fixing some deprecation warnings, it's not the most exciting work, but it's important to maintain a little. Later on we need to go over the scikit-learn compatibility and update to some of their newer features on their estimator classes. We're also brewing on some feature enhancements and tests, such as new fingerprints and a more versatile standardizer.

There are more information about how to contribute to the project in CONTRIBUTING

BUGS

Probably still, please check issues at GitHub and report there

Contributors

Scikit-Mol has been developed as a community effort with contributions from people from many different companies, consortia, foundations and academic institutions.

Cheminformania Consulting, Aptuit, BASF, Bayer AG, Boehringer Ingelheim, Chodera Lab (MSKCC), EPAM Systems,ETH Zürich, Evotec, Johannes Gutenberg University, Martin Luther University, Odyssey Therapeutics, Open Molecular Software Foundation, Openfree.energy, Polish Academy of Sciences, Productivista, Simulations-Plus Inc., University of Vienna

Owner

  • Name: Esben Jannik Bjerrum
  • Login: EBjerrum
  • Kind: user
  • Location: Sweden
  • Company: Odyssey Thereapeutics

https://www.cheminformania.com/about/esben-jannik-bjerrum/ https://www.linkedin.com/in/esbenbjerrum Mastodon @ChemITNerf@sigmoid.social

Citation (CITATION.bib)

@article{bjerrum_scikit-mol_2023,
	title = {Scikit-{Mol} brings cheminformatics to {Scikit}-{Learn}},
	author = {Bjerrum, Esben Jannik and Bachorz, Rafał Adam and Bitton, Adrien and Choung, Oh-hyeon and Chen, Ya and Esposito, Carmen and Ha, Son Viet and Poehlmann, Andreas},
	year = {2023},
	month = dec,
	journal = {ChemRxiv},
	url = {https://chemrxiv.org/engage/chemrxiv/article-details/60ef0fc58825826143a82cc0},
	doi = {10.26434/chemrxiv-2023-fzqwd},
	abstract = {Scikit-Mol is a open-source toolkit that aims to bridge the gap between two well-established toolkits, RDKit and Scikit-Learn, in order to provide a simple interface for building cheminformatics models. By leveraging the strengths of both RDKit and Scikit-Learn, Scikit-Mol provides a powerful platform for creating predictive modeling in drug discovery and materials design. Unlike other toolkits that often integrate both chemistry and machine learning, Scikit-Mol rather aims to be a simple bridge between the two, reducing the maintenance effort required to keep up with changes and new features in e.g. Scikit-Learn. A simple example of Scikit-Mol's functionality is provided, demonstrating its compatibility with Scikit-Learn pipelines. Overall, Scikit-Mol provides a useful and flexible package for building self-contained and self-documented cheminformatics models with minimal maintenance required.},
	language = {en},
	urldate = {2023-12-06},
	keywords = {Cheminformatics, Descriptors, Fingerprints, Machine Learning, RDKit, Scikit-Learn},
	note = {preprint}
}

GitHub Events

Total
  • Create event: 18
  • Release event: 7
  • Issues event: 13
  • Watch event: 85
  • Delete event: 8
  • Member event: 1
  • Issue comment event: 68
  • Push event: 32
  • Pull request event: 37
  • Pull request review comment event: 23
  • Pull request review event: 31
  • Fork event: 11
Last Year
  • Create event: 18
  • Release event: 7
  • Issues event: 13
  • Watch event: 85
  • Delete event: 8
  • Member event: 1
  • Issue comment event: 68
  • Push event: 32
  • Pull request event: 37
  • Pull request review comment event: 23
  • Pull request review event: 31
  • Fork event: 11

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 352
  • Total Committers: 20
  • Avg Commits per committer: 17.6
  • Development Distribution Score (DDS): 0.719
Past Year
  • Commits: 181
  • Committers: 9
  • Avg Commits per committer: 20.111
  • Development Distribution Score (DDS): 0.508
Top Committers
Name Email Commits
Esben Jannik Bjerrum e****k@r****m 99
Anton Siomchen 4****n 61
EBjerrum e****b@g****m 58
Enrico Gandini e****3@g****m 35
Esben Bjerrm e****n@o****m 23
Christian W. Feldmann c****n@g****m 13
Esben Jannik Bjerrum 1****m 13
riesben b****s@o****m 11
Mieczyslaw Torchala m****a@g****m 8
Rafal Bachorz r****l@b****u 7
son-ha-264 7****4 4
adrienchaton 3****n 4
Andreas Poehlmann a****s@p****o 4
Oh-hyeon Choung i****i@g****m 3
Kyle Barbary k****y@g****m 2
cespos 3****s 2
Ekaterina (Katja) Bjerrum 4****a 2
Rafal Bachorz r****z@s****m 1
Mike Henry 1****y 1
Ya Chen y****n@u****t 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 25
  • Total pull requests: 69
  • Average time to close issues: 3 months
  • Average time to close pull requests: 6 days
  • Total issue authors: 9
  • Total pull request authors: 17
  • Average comments per issue: 2.08
  • Average comments per pull request: 1.65
  • Merged pull requests: 57
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 12
  • Pull requests: 41
  • Average time to close issues: 16 days
  • Average time to close pull requests: 6 days
  • Issue authors: 7
  • Pull request authors: 9
  • Average comments per issue: 2.92
  • Average comments per pull request: 1.66
  • Merged pull requests: 31
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • EBjerrum (15)
  • asiomchen (2)
  • mikemhenry (2)
  • marcosfelt (1)
  • UnixJunkie (1)
  • atravitz (1)
  • FDUguchunhui (1)
  • RiesBen (1)
  • enricogandini (1)
Pull Request Authors
  • EBjerrum (23)
  • asiomchen (16)
  • saleha1wer (5)
  • enricogandini (4)
  • mieczyslaw (4)
  • RiesBen (3)
  • rafalbachorz (3)
  • ap-- (2)
  • son-ha-264 (2)
  • Ohyeon5 (2)
  • anya-chen (2)
  • c-feldmann (2)
  • kbarbary (2)
  • Productivista (2)
  • mikemhenry (2)
Top Labels
Issue Labels
question (1)
Pull Request Labels
documentation (5) enhancement (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 962 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 0
  • Total versions: 24
  • Total maintainers: 1
pypi.org: scikit-mol

scikit-learn classes for molecule transformation

  • Versions: 24
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Downloads: 962 Last month
Rankings
Dependent packages count: 6.6%
Forks count: 12.2%
Stargazers count: 12.9%
Downloads: 15.5%
Average: 15.6%
Dependent repos count: 30.6%
Maintainers (1)
Last synced: 7 months ago

Dependencies

pyproject.toml pypi
.github/workflows/run_pytests.yaml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite