opi
This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Keywords
Repository
This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.
Basic Info
Statistics
- Stars: 6
- Watchers: 3
- Forks: 0
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
[](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
[](https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE)
[](https://github.com/tatsu-lab/stanford_alpaca/blob/main/WEIGHT_DIFF_LICENSE)
# Vision Open Protein Instructions(OPI) is the initial part of Open Biology Instructions(OBI) project, together with the subsequent Open Molecule Instructions(OMI), Open DNA Instructions(ODI), Open RNA Instructions(ORI) and Open Single-cell Instructions (OSCI). OBI is a project which aims to fully leverage the potential ability of Large Language Models(LLMs), especially the scientific LLMs like Galactica, to facilitate research in AI for Life Science community. While OBI is still in an early stage, we hope to provide a starting point for the community to bridge LLMs and biological domain knowledge.
Paper
OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks has been accepted by NeurIPS 2024 Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges.
Hugging Face links to OPI dataset and OPI-tuned models
OPI Dataset
OPI-Llama-3.1-8B-Instruct
OPI-Galactica-6.7B
Contents
- [x] Project Overview
- [x] OPI dataset construction pipeline
- [x] OPI dataset overview
- [x] OPEval: Nine evaluation tasks using the OPI dataset
- [x] Instruction tuning with OPI training data
- [x] Evaluating with OPI testing data
- [x] Evaluation results
- [x] Prediction comparison with SOTA mdoels
- [x] Demo
- [x] Acknowledgement
- [x] Contact Information
Project Overview
This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.

Usage and license notices: Galactica is intended and licensed for research use only. Llama-3 is licensed for researchers and commercial entities, upholding the principles of openness. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff for Stanford Alpaca is also CC BY NC 4.0 (allowing only non-commercial use).
OPI dataset construction pipeline
The OPI dataset is curated on our own by extracting key information from Swiss-Prot database. The following figure shows the overall construction process of OPI.

- An example of OPI training data:
instruction: What is the EC classification of the input protein sequence based on its biological function? input: MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT ATERQYELQP output: 2.7.10.2 - An example of OPI testing data:
{"id": "seed_task_0", "name": "EC number of price dataset from CLEAN", "instruction": "Return the EC number of the protein sequence.", "instances": [{"input": "MAIPPYPDFRSAAFLRQHLRATMAFYDPVATDASGGQFHFFLDDGTVYNTHTRHLVSATRFVVTHAMLYRTTGEARYQVGMRHALEFLRTAFLDPATGGY AWLIDWQDGRATVQDTTRHCYGMAFVMLAYARAYEAGVPEARVWLAEAFDTAEQHFWQPAAGLYADEASPDWQLTSYRGQNANMHACEAMISAFRATGERR YIERAEQLAQGICQRQAALSDRTHAPAAEGWVWEHFHADWSVDWDYNRHDRSNIFRPWGYQVGHQTEWAKLLLQLDALLPADWHLPCAQRLFDTAVERGWD AEHGGLYYGMAPDGSICDDGKYHWVQAESMAAAAVLAVRTGDARYWQWYDRIWAYCWAHFVDHEHGAWFRILHRDNRNTTREKSNAGKVDYHNMGACYDVL LWALDAPGFSKESRSAALGRP", "output": "5.3.1.7"}], "is_classification": false}
OPI dataset overview
We are excited to announce the release of the OPI dataset, a curated collection of instructions covering 9 tasks for adapting LLMs to protein biology. The dataset is designed to advance LLM-driven research in the field of protein biology. We welcome contributions and enhancements to this dataset from the community. Thera are 1.64M samples, including training (1,615,661) and testing (26,607) sets, in OPI dataset.
Accessing the OPI dataset: The complete OPI dataset can be accessed from Hugging Face, which is organized into the three subfolders—AP, KM, and SU— in the OPIDATA directory, plusing the full dataset file OPIfull1.61M_train.json. Once downloaded, you can place all the subfolders and data files in the OPI_DATA folder within the repository. If you want to merge all or several training data files of the tasks into one single training data file, please do like this: ``` cd OPIDATA python mergetasktraindata.py --output OPImerged_train.json ```
OPI Dataset folder structure:
./OPI_DATA/
└── SU
│ ├── EC_number
│ │ ├── test
│ │ │ ├── CLEAN_EC_number_new_test.jsonl
│ │ │ └── CLEAN_EC_number_price_test.jsonl
│ │ └── train
│ │ ├── CLEAN_EC_number_train.json
│ ├── Fold_type
│ │ ├── test
│ │ │ └── fold_type_test.jsonl
│ │ └── train
│ │ └── fold_type_train.json
│ └── Subcellular_localization
│ ├── test
│ │ ├── subcell_loc_test.jsonl
│ └── train
└── subcell_loc_train.json
├── AP
│ └── Keywords
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_keywords_test.jsonl
│ │ │ ├── IDFilterSeq_keywords_test.jsonl
│ │ │ └── UniProtSeq_keywords_test.jsonl
│ │ └── train
│ │ ├── keywords_train.json
│ ├── GO
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_go_terms_test.jsonl
│ │ │ ├── IDFilterSeq_go_terms_test.jsonl
│ │ │ └── UniProtSeq_go_terms_test.jsonl
│ │ └── train
│ │ ├── go_terms_train.json
│ ├── Function
│ ├── test
│ │ ├── CASPSimilarSeq_function_test.jsonl
│ │ ├── IDFilterSeq_function_test.jsonl
│ │ └── UniProtSeq_function_test.jsonl
│ └── train
│ ├── function_train.json
├── KM
└── gSymbol2Tissue
│ ├── test
│ │ └── gene_symbol_to_tissue_test.jsonl
│ └── train
│ └── gene_symbol_to_tissue_train.json
├── gSymbol2Cancer
│ ├── test
│ │ └── gene_symbol_to_cancer_test.jsonl
│ └── train
│ └── gene_symbol_to_cancer_train.json
├── gName2Cancer
├── test
│ └── gene_name_to_cancer_test.jsonl
└── train
└── gene_name_to_cancer_train.json
OPEval: Nine evaluation tasks using the OPI dataset
To assess the effectiveness of instruction tuning with the OPI dataset, we developed OPEval, which comprises three categories of evaluation tasks. Each category includes three specific tasks. The table below outlines the task types, names, and the corresponding sizes of the training and testing sets.
| Task Type | Type Abbr. | Task Name | Task Abbr. | Training set size | Testing set size |
|---|---|---|---|---|---|
| Sequence Understanding | SU | EC Number Prediction | EC_number | 227,362 | 392 (NEW-392), 149 (Price-149) |
| Fold Type Prediction | Fold_type | 12,312 | 718 (Fold), 1254 (Superfamily), 1272 (Family) | ||
| Subcellular Localization Prediction | Subcellular_localization | 11,230 | 2,772 | ||
| Annotation Prediction | AP | Function Keywords Prediction | Keywords | 451,618 | 184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq) |
| Gene Ontology(GO) Terms Prediction | GO | 451,618 | 184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq) | ||
| Function Description Prediction | Function | 451,618 | 184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq) | ||
| Knowledge Mining | KM | Tissue Location Prediction from Gene Symbol | gSymbol2Tissue | 8,723 | 2,181 |
| Cancer Prediction from Gene Symbol | gSymbol2Cancer | 590 | 148 | ||
| Cancer Prediction from Gene Name | gName2Cancer | 590 | 148 |
Instruction tuning with OPI training data
Instruction tuning procedures are available in the instruction_tuning guide.
Accessing the OPI-Tuned Models: We have released the OPI-Llama-3.1-8B-Instruct and OPI-Galactica-6.7B models fine-tuned using OPIfull1.61M_train.json, which can be accessed from Hugging Face.
Evaluating with OPI testing data
Evalution procedures are outlined in the evaluation guide.
Evaluation results
Comprehensive evaluation results are detailed in th evaluation_results document.
Prediction comparison with SOTA mdoels
Prediction by OPI-tuned model, GPT-4o, Llama-3.1-8B-Instruct, Claude 3.5 Sonnet vs. Ground Trurh Answers are shown in in the model_compare document.
Demo
We use the FastChat platform to visually demonstrate the ability of OPI-Galactica-6.7B model on various evaluation tasks.

Acknowledgement
The codes are adapted from Stanford Alpaca and Chinese-LLaMA-Alpaca.
Galactica: Galactica
Llama-3.1: Llama-3.1
DeepSeek: DeepSeek-R1
Contact Information
For help or issues using the repos, please submit a GitHub issue.
For other communications, please contact Qiwei Ye (qwye@baai.ac.cn).
Owner
- Name: baaihealth
- Login: baaihealth
- Kind: organization
- Repositories: 2
- Profile: https://github.com/baaihealth
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use the code, data or model, please cite it as below."
preferred-citation:
type: misc
authors:
- family-names: "Xiao"
given-names: "Hongwang"
- family-names: "Lin"
given-names: "Wenjun"
- family-names: "Wang"
given-names: "Hui"
- family-names: "Liu"
given-names: "Zheng"
- family-names: "Ye"
given-names: "Qiwei"
title: "OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks"
year: 2024
GitHub Events
Total
- Watch event: 4
- Delete event: 1
- Push event: 12
- Pull request event: 2
- Create event: 1
Last Year
- Watch event: 4
- Delete event: 1
- Push event: 12
- Pull request event: 2
- Create event: 1
Dependencies
- accelerate *
- datasets *
- deepspeed ==0.9.1
- fire *
- gradio *
- numpy *
- openai *
- rouge_score *
- sentencepiece *
- tokenizers >=0.13.3
- torch *
- tqdm *
- transformers >=4.28.1
- wandb *