curategpt

LLM-driven curation assist tool

https://github.com/monarch-initiative/curategpt

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, zenodo.org
✓
Committers with academic emails
1 of 8 committers (12.5%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (19.0%) to scientific vocabulary

Keywords

ai biocuration curation gpt llm monarchinitiative obofoundry ontogpt ontologies ontology-tools

Keywords from Contributors

chat-gpt data-modeling gpt-3 information-extraction language-models large-language-models linkml named-entity-recognition ner oaklib

Last synced: 6 months ago · JSON representation ·

Repository

LLM-driven curation assist tool

Basic Info

Host: GitHub
Owner: monarch-initiative
License: bsd-3-clause
Language: Jupyter Notebook
Default Branch: main
Homepage: https://monarch-initiative.github.io/curategpt/
Size: 12.3 MB

Statistics

Stars: 91
Watchers: 6
Forks: 15
Open Issues: 38
Releases: 10

Topics

ai biocuration curation gpt llm monarchinitiative obofoundry ontogpt ontologies ontology-tools

Created over 2 years ago · Last pushed 7 months ago

Metadata Files

Readme Contributing License Citation

CurateGPT

CurateGPT is a prototype web application and framework for performing general purpose AI-guided curation and curation-related operations over collections of objects.

See also the app on curategpt.io (note: this is sometimes down, and may only have a subset of the functionality of the local app)

Getting started

User installation

CurateGPT is available on Pypi and may be installed with pip:

pip install curategpt

Developer installation

You will first need to install Poetry.

Then clone this repo.

git clone https://github.com/monarch-initiative/curategpt.git cd curategpt

and install the dependencies:

poetry install

API keys

In order to get the best performance from CurateGPT, we recommend getting an OpenAI API key, and setting it:

export OPENAI_API_KEY=<your key>

(for members of Monarch: ask on Slack if you would like to use the group key)

CurateGPT will also work with other large language models - see "Selecting models" below.

Loading example data and running the app

You initially start with an empty database. You can load whatever you like into this database! Any JSON, YAML, or CSV is accepted. CurateGPT comes with wrappers for some existing local and remote sources, including ontologies. The Makefile contains some examples of how to load these. You can load any ontology using the ont-<name> target, e.g.:

make ont-cl

This loads CL (via OAK) into a collection called ont_cl

Note that by default this loads into a collection set stored at stagedb, whereas the app works off of db. You can copy the collection set to the db with:

cp -r stagedb/* db/

You can then run the streamlit app with:

make app

Building Indexes

CurateGPT depends on vector database indexes of the databases/ontologies you want to curate.

The flagship application is ontology curation, so to build an index for an OBO ontology like CL:

make ont-cl

This requires an OpenAI key.

(You can build indexes using an open embedding model, modify the command to leave off the -m option, but this is not recommended as currently oai embeddings seem to work best).

To load the default ontologies:

make all

(this may take some time)

To load different databases:

make load-db-hpoa make load-db-reactome

You can load an arbitrary json, yaml, or csv file:

curategpt view index -c my_foo foo.json

(you will need to do this in the poetry shell)

To load a GitHub repo of issues:

curategpt -v view index -c gh_uberon -m openai: --view github --init-with "{repo: obophenotype/uberon}"

The following are also supported:

Google Drives
Google Sheets
Markdown files
LinkML Schemas
HPOA files
GOCAMs
MAXOA files
Many more

Notebooks

See notebooks for examples.

Selecting models

Currently this tool works best with the OpenAI gpt-4 model (for instruction tasks) and OpenAI ada-text-embedding-002 for embedding.

CurateGPT is layered on top of simonw/llm which has a plugin architecture for using alternative models. In theory you can use any of these plugins.

Additionally, you can set up an openai-emulating proxy using litellm.

The litellm proxy may be installed with pip as pip install litellm[proxy].

Let's say you want to run mixtral locally using ollama. You start up ollama (you may have to run ollama serve first):

ollama run mixtral

Then start up litellm:

litellm -m ollama/mixtral

Next edit your extra-openai-models.yaml as detailed in the llm docs:

- model_name: ollama/mixtral model_id: litellm-mixtral api_base: "http://0.0.0.0:8000"

You can now use this:

yaml curategpt ask -m litellm-mixtral -c ont_cl "What neurotransmitter is released by the hippocampus?"

But be warned that many of the prompts in curategpt were engineered against openai models, and they may give suboptimal results or fail entirely on other models. As an example, ask seems to work quite well with mixtral, but complete works horribly. We haven't yet investigated if the issue is the model or our prompts or the overall approach.

Welcome to the world of AI engineering!

Using the command line

bash curategpt --help

You will see various commands for working with indexes, searching, extracting, generating, etc.

These functions are generally available through the UI, and the current priority is documenting these.

Chatting with a knowledge base

curategpt ask -c ont_cl "What neurotransmitter is released by the hippocampus?"

may yield something like:

``` The hippocampus releases gamma-aminobutyric acid (GABA) as a neurotransmitter 1.

...

1

id: GammaAminobutyricAcidSecretion_neurotransmission label: gamma-aminobutyric acid secretion, neurotransmission definition: The regulated release of gamma-aminobutyric acid by a cell, in which the gamma-aminobutyric acid acts as a neurotransmitter. ... ```

Chatting with pubmed

curategpt view ask -V pubmed "what neurons express VIP?"

Chatting with a GitHub issue tracker

curategpt ask -c gh_obi "what are some new term requests for electrophysiology terms?"

Term Autocompletion (DRAGON-AI)

curategpt complete -c ont_cl "mesenchymal stem cell of the apical papilla"

yields

yaml id: MesenchymalStemCellOfTheApicalPapilla definition: A mesenchymal cell that is part of the apical papilla of a tooth and has the ability to self-renew and differentiate into various cell types such as odontoblasts, fibroblasts, and osteoblasts. relationships: - predicate: PartOf target: ApicalPapilla - predicate: subClassOf target: MesenchymalCell - predicate: subClassOf target: StemCell original_id: CL:0007045 label: mesenchymal stem cell of the apical papilla

All-by-all comparisons

You can compare all objects in one collection

curategpt all-by-all --threshold 0.80 -c ont_hp -X ont_mp --ids-only -t csv > ~/tmp/allxall.mp.hp.csv

This takes 1-2s, as it involves comparison over pre-computed vectors. It reports top hits above a threshold.

Results may vary. You may want to try different texts for embeddings (the default is the entire json object; for ontologies it is concatenation of labels, definition, aliases).

sample:

HP:5200068,Socially innappropriate questioning,MP:0001361,social withdrawal,0.844015132437909 HP:5200069,Spinning,MP:0001411,spinning,0.9077306606290237 HP:5200071,Delayed Echolalia,MP:0013140,excessive vocalization,0.8153252835818089 HP:5200072,Immediate Echolalia,MP:0001410,head bobbing,0.8348177036912526 HP:5200073,Excessive cleaning,MP:0001412,excessive scratching,0.8699103725005582 HP:5200104,Abnormal play,MP:0020437,abnormal social play behavior,0.8984862078522344 HP:5200105,Reduced imaginative play skills,MP:0001402,decreased locomotor activity,0.85571629684631 HP:5200108,Nonfunctional or atypical use of objects in play,MP:0003908,decreased stereotypic behavior,0.8586700411012859 HP:5200129,Abnormal rituals,MP:0010698,abnormal impulsive behavior control,0.8727804272023427 HP:5200134,Jumping,MP:0001401,jumpy,0.9011393233129765

Note that CurateGPT has a separate component for using an LLM to evaluate candidate matches (see also https://arxiv.org/abs/2310.03666); this is not enabled by default, this would be expensive to run for a whole ontology.

Owner

Name: Monarch Initiative
Login: monarch-initiative
Kind: organization
Location: Globally-distributed team (see https://monarchinitiative.org/page/team)

Website: https://github.com/monarch-initiative/monarch-app/blob/master/README.md#about-monarch
Repositories: 118
Profile: https://github.com/monarch-initiative

Cross-species disease discovery and diagnosis

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "O'Neil"
  given-names: "Shawn"
  orcid: "https://orcid.org/0000-0001-6220-7080"
- family-names: "Mungall"
  given-names: "Chris"
  orcid: "https://orcid.org/0000-0002-6601-2165"
title: "CurateGPT"
doi: 10.5281/zenodo.8388002
date-released: 2023-08-01
url: "https://github.com/monarch-initiative/curategpt"

GitHub Events

Total

Create event: 25
Release event: 3
Issues event: 35
Watch event: 34
Delete event: 17
Issue comment event: 64
Push event: 76
Pull request review comment event: 6
Pull request review event: 13
Pull request event: 44
Fork event: 3

Last Year

Create event: 25
Release event: 3
Issues event: 35
Watch event: 34
Delete event: 17
Issue comment event: 64
Push event: 76
Pull request review comment event: 6
Pull request review event: 13
Pull request event: 44
Fork event: 3

Committers

Last synced: 9 months ago

All Time

Total Commits: 320
Total Committers: 8
Avg Commits per committer: 40.0
Development Distribution Score (DDS): 0.672

Past Year

Commits: 259
Committers: 5
Avg Commits per committer: 51.8
Development Distribution Score (DDS): 0.614

Top Committers

Name	Email	Commits
caufieldjh	j**d@g**m	105
iQuxLE	c**5@g**m	96
Justin Reese	j**e@g**m	53
cmungall	c**m@b**g	51
Harshad Hegde	h**b@g**m	11
Mark A. Miller	M**M@l**v	2
realmarcin	4****n	1
Shawn T O'Neil	o**h@g**m	1

Committer Domains (Top 20 + Academic)

lbl.gov: 1 berkeleybop.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 52
Total pull requests: 70
Average time to close issues: 24 days
Average time to close pull requests: about 1 month
Total issue authors: 10
Total pull request authors: 5
Average comments per issue: 1.42
Average comments per pull request: 1.23
Merged pull requests: 57
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 30
Pull requests: 45
Average time to close issues: 13 days
Average time to close pull requests: 4 days
Issue authors: 8
Pull request authors: 4
Average comments per issue: 1.67
Average comments per pull request: 1.09
Merged pull requests: 39
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

caufieldjh (16)
justaddcoffee (5)
cmungall (2)
srobb1 (2)
laurasck14 (1)
leokim-l (1)
goodb (1)
idc9 (1)

Pull Request Authors

caufieldjh (34)
iQuxLE (26)
justaddcoffee (12)
cmungall (8)
turbomam (1)

curategpt

Science Score: 77.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

CurateGPT

Getting started

User installation

Developer installation

API keys

Loading example data and running the app

Building Indexes

Notebooks

Selecting models

Using the command line

Chatting with a knowledge base

1

Chatting with pubmed

Chatting with a GitHub issue tracker

Term Autocompletion (DRAGON-AI)

All-by-all comparisons

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels