https://github.com/amazon-science/lc-plm

LC-PLM: long-context protein language model based on BiMamba-S architecture

https://github.com/amazon-science/lc-plm

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary

Keywords

biology foundation-models foundation-models-for-biology huggingface mamba-state-space-models pretrained-models protein protein-sequences
Last synced: 6 months ago · JSON representation

Repository

LC-PLM: long-context protein language model based on BiMamba-S architecture

Basic Info
Statistics
  • Stars: 2
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
biology foundation-models foundation-models-for-biology huggingface mamba-state-space-models pretrained-models protein protein-sequences
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme Contributing License Code of conduct

README.md


license: cc-by-nc-4.0 tags: - biology

- protein

LC-PLM

LC-PLM is a frontier long-context protein language model based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models. It is pretrained on UniRef50/90 with masked language modeling (MLM) objective. For detailed information on the model architecture, training data, and evaluation performance, please refer to the accompanying paper.

You can use LC-PLM to extract embeddings for amino acid residues and protein sequences. It can also be fine-tuned to predict residue- or protein- level properties.

Getting started

Install Python dependencies

bash pip install transformers mamba-ssm==2.2.2

Clone this repo with pretrained model weights

We use Git Large File Storage (LFS) to version the model weights. You can obtain the pretrained model and its related files simply by cloning this repo: bash git clone https://github.com/amazon-science/LC-PLM.git

Run inference with the pretrained model

```python import torch from transformers import AutoTokenizer, AutoModelForMaskedLM

Load the model and tokenizer

model = AutoModelForMaskedLM.frompretrained("./LC-PLM", trustremotecode=True) tokenizer = AutoTokenizer.frompretrained("facebook/esm2t68M_UR50D")

Input a protein sequence:

fun fact: this is Mambalgin-1 from Black mamba

sequence = "MKTLLLTLLVVTIVCLDLGYSLKCYQHGKVVTCHRDMKFCYHNTGMPFRNLKLILQGCSSSCSETENNKCCSTDRCNK"

Tokenize the sequence:

inputs = tokenizer(sequence, return_tensors="pt")

Inference with LC-PLM on GPU

device = torch.device("cuda:0") model = model.to(device) inputs = {key: val.to(device) for key, val in inputs.items()} with torch.nograd(): outputs = model(**inputs, outputhidden_states=True)

Retrieve the embeddings

lasthiddenstate = outputs.hiddenstates[-1] print(lasthiddenstate.shape) # [batchsize, sequencelength, hiddendim] ```

Citation

@misc{wang2024longcontextproteinlanguagemodel, title={Long-context Protein Language Model}, author={Yingheng Wang and Zichen Wang and Gil Sadeh and Luca Zancato and Alessandro Achille and George Karypis and Huzefa Rangwala}, year={2024}, eprint={2411.08909}, archivePrefix={arXiv}, primaryClass={q-bio.BM}, url={https://arxiv.org/abs/2411.08909}, }

Security

See CONTRIBUTING for more information.

License

This project is licensed under the CC-BY-NC-4.0 License.

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Issues event: 3
  • Watch event: 21
  • Issue comment event: 2
  • Member event: 1
  • Public event: 1
  • Push event: 1
  • Fork event: 3
Last Year
  • Issues event: 3
  • Watch event: 21
  • Issue comment event: 2
  • Member event: 1
  • Public event: 1
  • Push event: 1
  • Fork event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: about 1 month
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 0.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: about 1 month
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • braviky (1)
  • psp3dcg (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels