opi

This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.

https://github.com/baaihealth/opi

Keywords

ai4science foundation-model instruction-tuning llm protein

Last synced: 10 months ago · JSON representation ·

Repository

This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.

Basic Info

Host: GitHub
Owner: baaihealth
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 52.2 MB

Statistics

Stars: 6
Watchers: 3
Forks: 0
Open Issues: 1
Releases: 0

Topics

ai4science foundation-model instruction-tuning llm protein

Created about 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

README.md

[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE) [![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE) [![Weight Diff License](https://img.shields.io/badge/Weight%20Diff%20License-CC%20By%20NC%204.0-yellow)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/WEIGHT_DIFF_LICENSE) # OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks

Vision Open Protein Instructions(OPI) is the initial part of Open Biology Instructions(OBI) project, together with the subsequent Open Molecule Instructions(OMI), Open DNA Instructions(ODI), Open RNA Instructions(ORI) and Open Single-cell Instructions (OSCI). OBI is a project which aims to fully leverage the potential ability of Large Language Models(LLMs), especially the scientific LLMs like Galactica, to facilitate research in AI for Life Science community. While OBI is still in an early stage, we hope to provide a starting point for the community to bridge LLMs and biological domain knowledge.

Paper

OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks has been accepted by NeurIPS 2024 Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges.

Hugging Face links to OPI dataset and OPI-tuned models

OPI Dataset
OPI-Llama-3.1-8B-Instruct
OPI-Galactica-6.7B

[x] Project Overview
[x] OPI dataset construction pipeline
[x] OPI dataset overview
[x] OPEval: Nine evaluation tasks using the OPI dataset
[x] Instruction tuning with OPI training data
[x] Evaluating with OPI testing data
[x] Evaluation results
[x] Prediction comparison with SOTA mdoels
[x] Demo
[x] Acknowledgement
[x] Contact Information

Project Overview

This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.

Usage and license notices: Galactica is intended and licensed for research use only. Llama-3 is licensed for researchers and commercial entities, upholding the principles of openness. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff for Stanford Alpaca is also CC BY NC 4.0 (allowing only non-commercial use).

OPI dataset construction pipeline

The OPI dataset is curated on our own by extracting key information from Swiss-Prot database. The following figure shows the overall construction process of OPI.

An example of OPI training data: instruction: What is the EC classification of the input protein sequence based on its biological function? input: MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT ATERQYELQP output: 2.7.10.2
An example of OPI testing data: {"id": "seed_task_0", "name": "EC number of price dataset from CLEAN", "instruction": "Return the EC number of the protein sequence.", "instances": [{"input": "MAIPPYPDFRSAAFLRQHLRATMAFYDPVATDASGGQFHFFLDDGTVYNTHTRHLVSATRFVVTHAMLYRTTGEARYQVGMRHALEFLRTAFLDPATGGY AWLIDWQDGRATVQDTTRHCYGMAFVMLAYARAYEAGVPEARVWLAEAFDTAEQHFWQPAAGLYADEASPDWQLTSYRGQNANMHACEAMISAFRATGERR YIERAEQLAQGICQRQAALSDRTHAPAAEGWVWEHFHADWSVDWDYNRHDRSNIFRPWGYQVGHQTEWAKLLLQLDALLPADWHLPCAQRLFDTAVERGWD AEHGGLYYGMAPDGSICDDGKYHWVQAESMAAAAVLAVRTGDARYWQWYDRIWAYCWAHFVDHEHGAWFRILHRDNRNTTREKSNAGKVDYHNMGACYDVL LWALDAPGFSKESRSAALGRP", "output": "5.3.1.7"}], "is_classification": false}

OPI dataset overview

We are excited to announce the release of the OPI dataset, a curated collection of instructions covering 9 tasks for adapting LLMs to protein biology. The dataset is designed to advance LLM-driven research in the field of protein biology. We welcome contributions and enhancements to this dataset from the community. Thera are 1.64M samples, including training (1,615,661) and testing (26,607) sets, in OPI dataset.

Accessing the OPI dataset: The complete OPI dataset can be accessed from Hugging Face, which is organized into the three subfolders—AP, KM, and SU— in the OPIDATA directory, plusing the full dataset file OPIfull1.61M_train.json. Once downloaded, you can place all the subfolders and data files in the OPI_DATA folder within the repository. If you want to merge all or several training data files of the tasks into one single training data file, please do like this: ``` cd OPIDATA python mergetasktraindata.py --output OPImerged_train.json ```

OPI Dataset folder structure: ./OPI_DATA/ └── SU │ ├── EC_number │ │ ├── test │ │ │ ├── CLEAN_EC_number_new_test.jsonl │ │ │ └── CLEAN_EC_number_price_test.jsonl │ │ └── train │ │ ├── CLEAN_EC_number_train.json │ ├── Fold_type │ │ ├── test │ │ │ └── fold_type_test.jsonl │ │ └── train │ │ └── fold_type_train.json │ └── Subcellular_localization │ ├── test │ │ ├── subcell_loc_test.jsonl │ └── train └── subcell_loc_train.json ├── AP │ └── Keywords │ │ ├── test │ │ │ ├── CASPSimilarSeq_keywords_test.jsonl │ │ │ ├── IDFilterSeq_keywords_test.jsonl │ │ │ └── UniProtSeq_keywords_test.jsonl │ │ └── train │ │ ├── keywords_train.json │ ├── GO │ │ ├── test │ │ │ ├── CASPSimilarSeq_go_terms_test.jsonl │ │ │ ├── IDFilterSeq_go_terms_test.jsonl │ │ │ └── UniProtSeq_go_terms_test.jsonl │ │ └── train │ │ ├── go_terms_train.json │ ├── Function │ ├── test │ │ ├── CASPSimilarSeq_function_test.jsonl │ │ ├── IDFilterSeq_function_test.jsonl │ │ └── UniProtSeq_function_test.jsonl │ └── train │ ├── function_train.json ├── KM └── gSymbol2Tissue │ ├── test │ │ └── gene_symbol_to_tissue_test.jsonl │ └── train │ └── gene_symbol_to_tissue_train.json ├── gSymbol2Cancer │ ├── test │ │ └── gene_symbol_to_cancer_test.jsonl │ └── train │ └── gene_symbol_to_cancer_train.json ├── gName2Cancer ├── test │ └── gene_name_to_cancer_test.jsonl └── train └── gene_name_to_cancer_train.json

OPEval: Nine evaluation tasks using the OPI dataset

To assess the effectiveness of instruction tuning with the OPI dataset, we developed OPEval, which comprises three categories of evaluation tasks. Each category includes three specific tasks. The table below outlines the task types, names, and the corresponding sizes of the training and testing sets.

Task Type	Type Abbr.	Task Name	Task Abbr.	Training set size	Testing set size
Sequence Understanding	SU	EC Number Prediction	EC_number	227,362	392 (NEW-392), 149 (Price-149)
		Fold Type Prediction	Fold_type	12,312	718 (Fold), 1254 (Superfamily), 1272 (Family)
		Subcellular Localization Prediction	Subcellular_localization	11,230	2,772
Annotation Prediction	AP	Function Keywords Prediction	Keywords	451,618	184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)
		Gene Ontology(GO) Terms Prediction	GO	451,618	184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)
		Function Description Prediction	Function	451,618	184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)
Knowledge Mining	KM	Tissue Location Prediction from Gene Symbol	gSymbol2Tissue	8,723	2,181
		Cancer Prediction from Gene Symbol	gSymbol2Cancer	590	148
		Cancer Prediction from Gene Name	gName2Cancer	590	148

Instruction tuning with OPI training data

Instruction tuning procedures are available in the instruction_tuning guide.

Accessing the OPI-Tuned Models: We have released the OPI-Llama-3.1-8B-Instruct and OPI-Galactica-6.7B models fine-tuned using OPIfull1.61M_train.json, which can be accessed from Hugging Face.

Evaluating with OPI testing data

Evalution procedures are outlined in the evaluation guide.

Evaluation results

Comprehensive evaluation results are detailed in th evaluation_results document.

Prediction comparison with SOTA mdoels

Prediction by OPI-tuned model, GPT-4o, Llama-3.1-8B-Instruct, Claude 3.5 Sonnet vs. Ground Trurh Answers are shown in in the model_compare document.

Demo

We use the FastChat platform to visually demonstrate the ability of OPI-Galactica-6.7B model on various evaluation tasks.

OPI Demo

Acknowledgement

The codes are adapted from Stanford Alpaca and Chinese-LLaMA-Alpaca.
Galactica: Galactica
Llama-3.1: Llama-3.1
DeepSeek: DeepSeek-R1

Contact Information

For help or issues using the repos, please submit a GitHub issue.
For other communications, please contact Qiwei Ye (qwye@baai.ac.cn).

Owner

Name: baaihealth
Login: baaihealth
Kind: organization

Repositories: 2
Profile: https://github.com/baaihealth

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use the code, data or model, please cite it as below."
preferred-citation:
  type: misc
  authors:
  - family-names: "Xiao"
    given-names: "Hongwang"
  - family-names: "Lin"
    given-names: "Wenjun"
  - family-names: "Wang"
    given-names: "Hui"
  - family-names: "Liu"
    given-names: "Zheng"
  - family-names: "Ye"
    given-names: "Qiwei"
  title: "OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks"
  year: 2024

GitHub Events

Total

Watch event: 4
Delete event: 1
Push event: 12
Pull request event: 2
Create event: 1

Last Year

Watch event: 4
Delete event: 1
Push event: 12
Pull request event: 2
Create event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

opi

Science Score: 44.0%