simulategpt

Large language models as universal biomedical simulators

https://github.com/openbiolink/simulategpt

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Large language models as universal biomedical simulators

Basic Info
  • Host: GitHub
  • Owner: OpenBioLink
  • License: mit
  • Language: R
  • Default Branch: main
  • Homepage:
  • Size: 2.35 MB
Statistics
  • Stars: 19
  • Watchers: 3
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Simulator

logo

Open In Colab

Computational simulation of biological processes can be a valuable tool in accelerating biomedical research, but usually requires a high level of domain knowledge and extensive manual adaptations. Recently, large language models (LLMs) – such as GPT-4 have proven surprisingly successful in solving complex tasks across diverse fields by emulating human language generation at a very large scale. Here we explore the potential of leveraging LLMs as simulators of biological systems. We establish proof-of-concept of a text-based simulator, SimulateGPT, that leverages LLM reasoning. We demonstrate good prediction performance across diverse biomedical use cases without explicit domain knowledge or manual tuning. Our results show that LLMs can be used as versatile and broadly applicable biological simulators.

graphical abstract

Citation

If you find our work useful in your research, please cite:

Moritz Schaefer, Stephan Reichl, Rob ter Horst, Adele M. Nicolas, Thomas Krausgruber, Francesco Piras, Peter Stepper, Christoph Bock#, Matthias Samwald#. (2024). *GPT-4 as a biomedical simulator.** Computers in Biology and Medicine, 178, 108796. doi: 10.1016/j.compbiomed.2024.108796.

BioRxiv Preprint (2023)

Moritz Schaefer, Stephan Reichl, Rob ter Horst, Adele M. Nicolas, Thomas Krausgruber, Francesco Piras, Peter Stepper, Christoph Bock#, Matthias Samwald#. (2023). *Large language models are universal biomedical simulators** doi: 10.1101/2023.06.16.545235v1

Repository structure

Folders: - systemmessages/: GPT-4 system prompts with simple descriptive names, e.g. "simulator4markdown" - experiments/: Protocols, code and results for executed (and planned/running) experiments. For details, see subsection below - <experimentname>/ - main.md - code or (meta-)data files - prompts/ - ... - ai_messages:

Experiments

Each experiment is kept in a separate folder containing:

  • main.md: Experiment documentation (objective, method, results, conclusion) using Markdown (main.md), in addition to the paper's methods section.
  • prompts/: prompts for this experiment user prompts
  • aimessages/: (Chat)GPT4-generated results. File name schema: <systemmessage>--

Using Snakemake to run experiments

Simply run snakemake -c1 -k --config experiment_name=<your_experiment_name> (1 core, continue with undone jobs if a job failed). If you want to use my conda env, add --use-conda.

The pipeline generates the files according to the schema indicated above.

Run all experiments

To run all experiments, call snakemake like so:

for experiment_name in $(ls experiments); do snakemake -c1 --config experiment_name=$experiment_name; done

Code files

src/utils.py

The top-level utils file provides 'everything you need' to run your prompts in an automated fashion. The functions are simple, documented and reflect the defined repository structure.

We streamlined our API access using snakemake.

Make sure to provide your private OPEN AI API key as argument (api_key), environment variable (OPENAI_API_KEY), or in the password store.

Notebook

The Simulator.ipynb notebook is configured to work within colab, but will also work on your local installation.

Human/Input prompt guidelines

  • Provide a starting point for the simulation e.g., a situation or experimental setup or a detailed/complex question that will be answered using a simulation.
    • Optional: Can include/imply a perturbation
  • If you expect a final outcome, explicitly request it (use the words) ‘final outcome’
  • Optional: You can increase the novelty by adding: "Focus on more novelty."
  • The simulator can be used to ask detailed/complex questions about biology. The simulator has the potential to assess the question in more depth and provide more informed answers than the default ChatGPT.

Owner

  • Name: OpenBioLink
  • Login: OpenBioLink
  • Kind: organization

Projects of the Samwald lab at the Institute of Artificial Intelligence, Vienna

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: SimulateGPT
message: GPT-4 as a biomedical simulator
type: software
authors:
  - given-names: Moritz
    family-names: Schaefer
    email: mschaefer@cemm.oeaw.ac.at
    affiliation: CeMM Research Center for Molecular Medicine
    orcid: 'https://orcid.org/0000-0001-6489-1947'
  - given-names: Stephan
    family-names: Reichl
    email: sreichl@cemm.oeaw.ac.at
    affiliation: CeMM Research Center for Molecular Medicine
    orcid: 'https://orcid.org/0000-0001-8555-7198'
  - given-names: Rob
    family-names: ter Horst
    email: rterhorst@cemm.oeaw.ac.at
    affiliation: CeMM Research Center for Molecular Medicine
    orcid: 'https://orcid.org/0000-0003-0576-5873'
  - given-names: Adele M
    family-names: Nicolas
    email: anicolas@cemm.oeaw.ac.at
    affiliation: CeMM Research Center for Molecular Medicine
    orcid: 'https://orcid.org/0000-0003-0784-7207'
  - given-names: Thomas
    family-names: Krausgruber
    email: tkrausgruber@cemm.oeaw.ac.at
    affiliation: CeMM Research Center for Molecular Medicine
    orcid: 'https://orcid.org/0000-0002-1374-0329'
  - given-names: Francesco
    family-names: Piras
    email: fpiras@cemm.oeaw.ac.at
    affiliation: CeMM Research Center for Molecular Medicine
    orcid: 'https://orcid.org/0000-0002-0938-6072'
  - given-names: Peter
    family-names: Stepper
    email: pstepper@cemm.oeaw.ac.at
    affiliation: CeMM Research Center for Molecular Medicine
    orcid: 'https://orcid.org/0000-0003-1785-2405'
  - given-names: Christoph
    family-names: Bock
    email: cbock@cemm.oeaw.ac.at
    affiliation: CeMM Research Center for Molecular Medicine
    orcid: 'https://orcid.org/0000-0001-6091-3088'
  - given-names: Matthias
    family-names: Samwald
    email: matthias.samwald@meduniwien.ac.at
    affiliation: Medical University of Vienna
    orcid: 'https://orcid.org/0000-0002-4855-2571'
identifiers:
  - type: doi
    value: 10.1016/j.compbiomed.2024.108796
    description: Computers in Biology and Medicine Paper DOI
  - type: url
    value: 'https://doi.org/10.1016/j.compbiomed.2024.108796'
    description: Computers in Biology and Medicine Paper URL
  - type: doi
    value: 10.1101/2023.06.16.545235
    description: bioRxiv DOI
  - type: url
    value: >-
      https://www.biorxiv.org/content/10.1101/2023.06.16.545235v1
    description: bioRxiv URL
repository-code: 'https://github.com/OpenBioLink/SimulateGPT'
abstract: >-
  Background

  Computational simulation of biological processes can be a
  valuable tool for accelerating biomedical research, but
  usually requires extensive domain knowledge and manual
  adaptation. Large language models (LLMs) such as GPT-4
  have proven surprisingly successful for a wide range of
  tasks. This study provides proof-of-concept for the use of
  GPT-4 as a versatile simulator of biological systems.


  Methods

  We introduce SimulateGPT, a proof-of-concept for
  knowledge-driven simulation across levels of biological
  organization through structured prompting of GPT-4. We
  benchmarked our approach against direct GPT-4 inference in
  blinded qualitative evaluations by domain experts in four
  scenarios and in two quantitative scenarios with
  experimental ground truth. The qualitative scenarios
  included mouse experiments with known outcomes and
  treatment decision support in sepsis. The quantitative
  scenarios included prediction of gene essentiality in
  cancer cells and progression-free survival in cancer
  patients.


  Results

  In qualitative experiments, biomedical scientists rated
  SimulateGPT's predictions favorably over direct GPT-4
  inference. In quantitative experiments, SimulateGPT
  substantially improved classification accuracy for
  predicting the essentiality of individual genes and
  increased correlation coefficients and precision in the
  regression task of predicting progression-free survival.


  Conclusion

  This proof-of-concept study suggests that LLMs may enable
  a new class of biomedical simulators. Such text-based
  simulations appear well suited for modeling and
  understanding complex living systems that are difficult to
  describe with physics-based first-principles simulations,
  but for which extensive knowledge is available as written
  text. Finally, we propose several directions for further
  development of LLM-based biomedical simulators, including
  augmentation through web search retrieval, integrated
  mathematical modeling, and fine-tuning on experimental
  data.
keywords:
  - Biomedicine
  - Simulation
  - Large Language Models
  - Computational Biology
  - Artificial intelligence
license: MIT

GitHub Events

Total
  • Watch event: 7
Last Year
  • Watch event: 7