chempfn

Ensemble-based, size-agnostic wrapper for the TabPFN classifier

https://github.com/ersilia-os/chempfn

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary

Keywords

classification ensemble machine-learning tabpfn
Last synced: 7 months ago · JSON representation ·

Repository

Ensemble-based, size-agnostic wrapper for the TabPFN classifier

Basic Info
  • Host: GitHub
  • Owner: ersilia-os
  • License: gpl-3.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 41 MB
Statistics
  • Stars: 31
  • Watchers: 2
  • Forks: 0
  • Open Issues: 14
  • Releases: 1
Topics
classification ensemble machine-learning tabpfn
Created about 3 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Changelog License Citation

README.md

ChemPFN

TabPFN is a transformer architecture proposed by Hollman et al. for classification on small tabular datasets. It is a Prior-Data Fitted Network (PFN) that has been trained once and does not require fine tuning for new datasets.

TabPFN works by approximating the distribution of new data to the prior synthetic data it has seen during training. In a machine learning pipeline, this network can be "fit" on a training dataset in under a second and can generate predictions for the test set in a single forward pass in the network.

With ChemPFN, we address some of the limitations of the original TabPFN model and extend it to work with Chemical datasets using Ersilia Compound Embeddings. Using data and feature subsampling strategies, ChemPFN bypasses the limitation of 1000 rows and 100 features inherent in TabPFN. It is fully compatible with the Scikit-learn API and can be used in a modeling pipeline like any Scikit-learn estimator.

ChemPFN, when fit, creates ensembles of data points and input dimenions, if required. During the predict stage, it creates an ensemble of TabPFN models fit on the training set to generate predictions for the test set. These intermediate ensemble results are then aggregated to produce the final prediction. With this approach, the model is able to fit in under a second, however predictions can be slow based on configuration (see below), or the underlying hardware.

This model can be used directly with SMILES data without the need for prior featurization. Additionally, we provided a utility to explore this model on Antimicrobials dataset from ChEMBL.

Installation

```bash

git clone https://github.com/ersilia-os/chempfn.git cd chempfn pip install . ```

Usage

By default, ChemPFN generates 100 data samples of size 1000 each to work with TabPFN. This can be configured to a lower number (for example, max_iters=10) to speeed up prediction.

```python

from chempfn import ChemPFN from sklearn.metrics import accuracy_score

clf = ChemPFN(maxiters=100) clf.fit(Xtrain, ytrain) yhat = clf.predict(ytest) acc = accuracyscore(ytest, yhat) ```

Explore Antimicrobial Datasets

We provide a utility class to retrieve pre processed antimicrobial datasets. We list below the pathogens that are currently supported. For each pathogen, we allow the user to select a confidence level (hc or lc) for obtaining the assay activity.

  • Acinetobacter baumannii
  • Campylobacter spp.
  • Enterococcus faecium
  • Enterobacter spp.
  • Escherichia coli
  • Helicobacter pylori
  • Klebsiella pneumoniae
  • Mycobacterium tuberculosis
  • Neisseria gonorrhoeae
  • Plasmodium spp.
  • Pseudomonas aeruginosa
  • Schistosoma mansoni
  • Staphylococcus aureus
  • Streptococcus pneumoniae

```python

Import the dataset loader

from chempfn.utils import AntiMicrobialsDatasetLoader

datasetloader = AntiMicrobialsDatasetLoader() df = datasetloader.load('ecoli', 'hc') ```

Citation

If you use this package, please cite the original authors of the model and this package.

License

This package is licensed under a GPL-3.0 license.

Owner

  • Name: Ersilia Open Source Initiative
  • Login: ersilia-os
  • Kind: organization
  • Email: hello@ersilia.io
  • Location: United Kingdom

Ersilia is a charity developing open source tools to facilitate global health drug discovery, with a focus on neglected diseases, for equal healthcare

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Arora"
  given-names: "Dhanshree"
- family-names: "Turon"
  given-names: "Gemma"
  orcid: "https://orcid.org/0000-0001-6798-0275"
- family-names: "Duran-Frigola"
  given-names: "Miquel"
  orcid: "https://orcid.org/0000-0002-9906-6936"
title: "Scalable TabPFN with ensemble learning for classification tasks"
version: 0.1.1
doi: 10.5281/zenodo.7690900
date-released: 2023-03-01
url: "https://github.com/ersilia-os/chempfn"

GitHub Events

Total
  • Watch event: 4
Last Year
  • Watch event: 4

Dependencies

pyproject.toml pypi
  • lolP 0.0.4
  • python ^3.8.1
  • tabpfn 0.1.8
Dockerfile docker
  • python 3.10.7-bullseye build
setup.py pypi