https://github.com/benlansdell/biogpt

https://github.com/benlansdell/biogpt

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: benlansdell
  • License: mit
  • Default Branch: main
  • Size: 30.7 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of microsoft/BioGPT
Created over 3 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Code of conduct Security

README.md

BioGPT

This repository contains the implementation of BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.

News!

  • BioGPT-Large model with 1.5B parameters is coming, currently available on PubMedQA task with SOTA performance of 81% accuracy. See Question Answering on PubMedQA for evaluation.

Requirements and Installation

  • PyTorch version == 1.12.0
  • Python version == 3.10
  • fairseq version == 0.12.0:

bash git clone https://github.com/pytorch/fairseq cd fairseq git checkout v0.12.0 pip install . python setup.py build_ext --inplace cd .. * Moses bash git clone https://github.com/moses-smt/mosesdecoder.git export MOSES=${PWD}/mosesdecoder * fastBPE bash git clone https://github.com/glample/fastBPE.git export FASTBPE=${PWD}/fastBPE cd fastBPE g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast * sacremoses bash pip install sacremoses * sklearn bash pip install scikit-learn

Remember to set the environment variables MOSES and FASTBPE to the path of Moses and fastBPE respetively, as they will be required later.

Getting Started

Pre-trained models

We provide our pre-trained BioGPT model checkpoints along with fine-tuned checkpoints for downstream tasks, available both through URL download as well as through the Hugging Face 🤗 Hub.

|Model|Description|URL|🤗 Hub| |----|----|---|---| |BioGPT|Pre-trained BioGPT model checkpoint|link|link| |BioGPT-Large|Pre-trained BioGPT-Large model checkpoint|link|link| |BioGPT-QA-PubMedQA-BioGPT|Fine-tuned BioGPT for question answering task on PubMedQA|link| | |BioGPT-QA-PubMEDQA-BioGPT-Large|Fine-tuned BioGPT-Large for question answering task on PubMedQA|link|link| |BioGPT-RE-BC5CDR|Fine-tuned BioGPT for relation extraction task on BC5CDR|link| | |BioGPT-RE-DDI|Fine-tuned BioGPT for relation extraction task on DDI|link| | |BioGPT-RE-DTI|Fine-tuned BioGPT for relation extraction task on KD-DTI|link| | |BioGPT-DC-HoC|Fine-tuned BioGPT for document classification task on HoC|link| |

Download them and extract them to the checkpoints folder of this project.

For example: bash mkdir checkpoints cd checkpoints wget https://msramllasc.blob.core.windows.net/modelrelease/BioGPT/checkpoints/Pre-trained-BioGPT.tgz tar -zxvf Pre-trained-BioGPT.tgz

Example Usage

Use pre-trained BioGPT model in your code: python import torch from fairseq.models.transformer_lm import TransformerLanguageModel m = TransformerLanguageModel.from_pretrained( "checkpoints/Pre-trained-BioGPT", "checkpoint.pt", "data", tokenizer='moses', bpe='fastbpe', bpe_codes="data/bpecodes", min_len=100, max_len_b=1024) m.cuda() src_tokens = m.encode("COVID-19 is") generate = m.generate([src_tokens], beam=5)[0] output = m.decode(generate[0]["tokens"]) print(output)

Use fine-tuned BioGPT model on KD-DTI for drug-target-interaction in your code: python import torch from src.transformer_lm_prompt import TransformerLanguageModelPrompt m = TransformerLanguageModelPrompt.from_pretrained( "checkpoints/RE-DTI-BioGPT", "checkpoint_avg.pt", "data/KD-DTI/relis-bin", tokenizer='moses', bpe='fastbpe', bpe_codes="data/bpecodes", max_len_b=1024, beam=1) m.cuda() src_text="" # input text, e.g., a PubMed abstract src_tokens = m.encode(src_text) generate = m.generate([src_tokens], beam=args.beam)[0] output = m.decode(generate[0]["tokens"]) print(output)

For more downstream tasks, please see below.

Downstream tasks

See corresponding folder in examples:

Relation Extraction on BC5CDR

Relation Extraction on KD-DTI

Relation Extraction on DDI

Document Classification on HoC

Question Answering on PubMedQA

Text Generation

Hugging Face 🤗 Usage

BioGPT has also been integrated into the Hugging Face transformers library, and model checkpoints are available on the Hugging Face Hub.

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

python from transformers import pipeline, set_seed from transformers import BioGptTokenizer, BioGptForCausalLM model = BioGptForCausalLM.from_pretrained("microsoft/biogpt") tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt") generator = pipeline('text-generation', model=model, tokenizer=tokenizer) set_seed(42) generator("COVID-19 is", max_length=20, num_return_sequences=5, do_sample=True)

Here is how to use this model to get the features of a given text in PyTorch:

python from transformers import BioGptTokenizer, BioGptForCausalLM tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt") model = BioGptForCausalLM.from_pretrained("microsoft/biogpt") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)

Beam-search decoding:

```python import torch from transformers import BioGptTokenizer, BioGptForCausalLM, set_seed

tokenizer = BioGptTokenizer.frompretrained("microsoft/biogpt") model = BioGptForCausalLM.frompretrained("microsoft/biogpt")

sentence = "COVID-19 is" inputs = tokenizer(sentence, return_tensors="pt")

set_seed(42)

with torch.nograd(): beamoutput = model.generate(**inputs, minlength=100, maxlength=1024, numbeams=5, earlystopping=True ) tokenizer.decode(beamoutput[0], skipspecial_tokens=True) ```

For more information, please see the documentation on the Hugging Face website.

Demos

Check out these demos on Hugging Face Spaces: * Text Generation with BioGPT-Large * Question Answering with BioGPT-Large-PubMedQA

License

BioGPT is MIT-licensed. The license applies to the pre-trained models as well.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner

  • Name: Ben Lansdell
  • Login: benlansdell
  • Kind: user
  • Location: Santa Fe, NM
  • Company: Health stealth

Machine learning and applied mathematics | Former postdoc @KordingLab UPenn, PhD in applied mathematics @Fairhall-Lab UW

GitHub Events

Total
Last Year

Dependencies

.github/workflows/codeql-analysis.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
requirements.txt pypi
  • Cython ==0.29.33
  • PyYAML ==6.0
  • antlr4-python3-runtime ==4.8
  • bitarray ==2.6.2
  • cffi ==1.15.1
  • click ==8.1.3
  • colorama ==0.4.6
  • fairseq ==0.12.2
  • hydra-core ==1.0.7
  • joblib ==1.2.0
  • lxml ==4.9.2
  • numpy ==1.24.1
  • omegaconf ==2.0.6
  • portalocker ==2.7.0
  • protobuf ==3.20.1
  • pycparser ==2.21
  • regex ==2022.10.31
  • sacrebleu ==2.3.1
  • sacremoses ==0.0.53
  • scikit-learn ==1.2.1
  • scipy ==1.10.0
  • six ==1.16.0
  • tabulate ==0.9.0
  • tensorboardX ==2.5.1
  • threadpoolctl ==3.1.0
  • torch ==1.12.0
  • torchaudio ==0.12.0
  • tqdm ==4.64.1
  • typing-extensions ==4.4.0