chempfn
Ensemble-based, size-agnostic wrapper for the TabPFN classifier
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Keywords
Repository
Ensemble-based, size-agnostic wrapper for the TabPFN classifier
Basic Info
Statistics
- Stars: 31
- Watchers: 2
- Forks: 0
- Open Issues: 14
- Releases: 1
Topics
Metadata Files
README.md
ChemPFN
TabPFN is a transformer architecture proposed by Hollman et al. for classification on small tabular datasets. It is a Prior-Data Fitted Network (PFN) that has been trained once and does not require fine tuning for new datasets.
TabPFN works by approximating the distribution of new data to the prior synthetic data it has seen during training. In a machine learning pipeline, this network can be "fit" on a training dataset in under a second and can generate predictions for the test set in a single forward pass in the network.
With ChemPFN, we address some of the limitations of the original TabPFN model and extend it to work with Chemical datasets using Ersilia Compound Embeddings. Using data and feature subsampling strategies, ChemPFN bypasses the limitation of 1000 rows and 100 features inherent in TabPFN. It is fully compatible with the Scikit-learn API and can be used in a modeling pipeline like any Scikit-learn estimator.
ChemPFN, when fit, creates ensembles of data points and input dimenions, if required. During the predict stage, it creates an ensemble of TabPFN models fit on the training set to generate predictions for the test set. These intermediate ensemble results are then aggregated to produce the final prediction. With this approach, the model is able to fit in under a second, however predictions can be slow based on configuration (see below), or the underlying hardware.
This model can be used directly with SMILES data without the need for prior featurization. Additionally, we provided a utility to explore this model on Antimicrobials dataset from ChEMBL.
Installation
```bash
git clone https://github.com/ersilia-os/chempfn.git cd chempfn pip install . ```
Usage
By default, ChemPFN generates 100 data samples of size 1000 each to work with TabPFN. This can be configured to a lower number (for example, max_iters=10) to speeed up prediction.
```python
from chempfn import ChemPFN from sklearn.metrics import accuracy_score
clf = ChemPFN(maxiters=100) clf.fit(Xtrain, ytrain) yhat = clf.predict(ytest) acc = accuracyscore(ytest, yhat) ```
Explore Antimicrobial Datasets
We provide a utility class to retrieve pre processed antimicrobial datasets. We list below the pathogens that are currently supported. For each pathogen, we allow the user to select a confidence level (hc or lc) for obtaining the assay activity.
- Acinetobacter baumannii
- Campylobacter spp.
- Enterococcus faecium
- Enterobacter spp.
- Escherichia coli
- Helicobacter pylori
- Klebsiella pneumoniae
- Mycobacterium tuberculosis
- Neisseria gonorrhoeae
- Plasmodium spp.
- Pseudomonas aeruginosa
- Schistosoma mansoni
- Staphylococcus aureus
- Streptococcus pneumoniae
```python
Import the dataset loader
from chempfn.utils import AntiMicrobialsDatasetLoader
datasetloader = AntiMicrobialsDatasetLoader() df = datasetloader.load('ecoli', 'hc') ```
Citation
If you use this package, please cite the original authors of the model and this package.
License
This package is licensed under a GPL-3.0 license.
Owner
- Name: Ersilia Open Source Initiative
- Login: ersilia-os
- Kind: organization
- Email: hello@ersilia.io
- Location: United Kingdom
- Website: ersilia.io
- Twitter: ersiliaio
- Repositories: 64
- Profile: https://github.com/ersilia-os
Ersilia is a charity developing open source tools to facilitate global health drug discovery, with a focus on neglected diseases, for equal healthcare
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Arora" given-names: "Dhanshree" - family-names: "Turon" given-names: "Gemma" orcid: "https://orcid.org/0000-0001-6798-0275" - family-names: "Duran-Frigola" given-names: "Miquel" orcid: "https://orcid.org/0000-0002-9906-6936" title: "Scalable TabPFN with ensemble learning for classification tasks" version: 0.1.1 doi: 10.5281/zenodo.7690900 date-released: 2023-03-01 url: "https://github.com/ersilia-os/chempfn"
GitHub Events
Total
- Watch event: 4
Last Year
- Watch event: 4
Dependencies
- lolP 0.0.4
- python ^3.8.1
- tabpfn 0.1.8
- python 3.10.7-bullseye build