lobster

Lbster: Language models for Biological Sequence Transformation and Evolutionary Representation

https://github.com/prescient-design/lobster

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, biorxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary

Keywords

biology bioml plm protein-sequences

Last synced: 9 months ago · JSON representation ·

Repository

Lbster: Language models for Biological Sequence Transformation and Evolutionary Representation

Basic Info

Host: GitHub
Owner: prescient-design
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://prescient-design.github.io/lobster-docs/
Size: 13.8 MB

Statistics

Stars: 128
Watchers: 11
Forks: 30
Open Issues: 24
Releases: 15

Topics

biology bioml plm protein-sequences

Created over 2 years ago · Last pushed 9 months ago

Metadata Files

Readme Contributing License Citation

LBSTER 🦞

Language models for Biological Sequence Transformation and Evolutionary Representation

lobster is a "batteries included" language model library for proteins and other biological sequences. Led by Nathan Frey, Karina Zadorozhny, Taylor Joren, Sidney Lisanza, Aya Abdlesalam Ismail, Joseph Kleinhenz and Allen Goodman, with many valuable contributions from Contributors across Prescient Design, Genentech.

This repository contains training code and access to pre-trained language models for biological sequence data.

Usage

Table of contents

- [Why you should use LBSTER](#why-use) - [Citations](#citations) - [Install instructions](#install) - [Models](#main-models) - [Notebooks](#notebooks) - [MCP Server](#mcp-integration) - [Training and inference](#training) - [Reinforcement Learning with UME](#rl-training) - [Contributing](#contributing)

Why you should use LBSTER

LBSTER is built for pre-training models quickly from scratch. It is "batteries included." This is most useful if you need to control the pre-training data mixture and embedding space, or want to experiment with novel pre-training objectives and fine-tuning strategies.
LBSTER is a living, open-source library that will be periodically updated with new code and pre-trained models from the Frey Lab at Prescient Design, Genentech. The Frey Lab works on real therapeutic molecule design problems and LBSTER models and capabilities reflect the demands of real-world drug discovery campaigns.
LBSTER is built with beignet, a standard library for biological research, and integrated with cortex, a modular framework for multitask modeling, guided generation, and multi-modal models.
LBSTER supports concepts; we have a concept-bottleneck protein language model, CB-LBSTER, which supports 718 concepts.

Citations

If you use the code and/or models, please cite the relevant papers. For the lbster code base cite: Cramming Protein Language Model Training in 24 GPU Hours ```bibtex @article{Frey2024.05.14.594108, author = {Frey, Nathan C. and Joren, Taylor and Ismail, Aya Abdelsalam and Goodman, Allen and Bonneau, Richard and Cho, Kyunghyun and Gligorijevi{\'c}, Vladimir}, title = {Cramming Protein Language Model Training in 24 GPU Hours}, elocation-id = {2024.05.14.594108}, year = {2024}, doi = {10.1101/2024.05.14.594108}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108}, eprint = {https://www.biorxiv.org/content/early/2024/05/15/2024.05.14.594108.full.pdf}, journal = {bioRxiv} }

```

For the cb-lbster code base cite: Concept Bottleneck Language Models for Protein Design ```bibtex @article{ismail2024conceptbottlenecklanguagemodels, title={Concept Bottleneck Language Models For protein design}, author={Aya Abdelsalam Ismail and Tuomas Oikarinen and Amy Wang and Julius Adebayo and Samuel Stanton and Taylor Joren and Joseph Kleinhenz and Allen Goodman and Héctor Corrada Bravo and Kyunghyun Cho and Nathan C. Frey}, year={2024}, eprint={2411.06090}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2411.06090}, }

```

Install

Using `uv`

Install uv and create a new virtual environment:

bash uv venv --python 3.12 # create a new virtual environment in the `lobster` directory source .venv/bin/activate uv pip install -e .

Alternatively, run installation directly with uv sync: bash uv sync uv sync --all-extras --no-cache # to resolve flash-attn installation issues

and then prefix every command with uv run. For example,

bash uv run lobster_train data.path_to_fasta="test_data/query.fasta"

flash attention

To make use of flash attention install with the flash extra bash uv sync --extra flash

Using `mamba`

clone the repo, cd into it and do mamba env create -f env.yml then from the root of the repo, do bash pip install -e .

Main models you should use

Pretrained Models

Masked LMs

| Shorthand | #params | Dataset | Description | Model checkpoint | |---------|------------|---------|------------------------------------------------------------|-------------| Lobster24M | 24 M | uniref50 | 24M parameter protein Masked LLM trained on uniref50| lobster_24M Lobster150M | 150 M | uniref50 | 150M parameter protein Masked LLM trained on uniref50|lobster_150M

CB LMs

| Shorthand | #params | Dataset | Description | Model checkpoint | |---------|------------|---------|------------------------------------------------------------|-------------| cbLobster24M | 24 M | uniref50+SwissProt | 24M parameter a protein concept bottleneck model for proteins with 718 concepts | cblobster24M cbLobster150M | 150 M | uniref50+SwissProt |150M parameter a protein concept bottleneck model for proteins with 718 concepts|cblobster150M cbLobster650M | 650 M | uniref50+SwissProt |650M parameter a protein concept bottleneck model for proteins with 718 concepts|cblobster650M cbLobster3B | 3 B | uniref50+SwissProt |3B parameter a protein concept bottleneck model for proteins with 718 concepts|cblobster3B

Loading a pre-trained model

python from lobster.model import LobsterPMLM, LobsterPCLM, LobsterCBMPMLM masked_language_model = LobsterPMLM("asalam91/lobster_24M") concept_bottleneck_masked_language_model = LobsterCBMPMLM("asalam91/cb_lobster_24M") causal_language_model = LobsterPCLM.load_from_checkpoint(<path to ckpt>) 3D, cDNA, and dynamic models use the same classes.

Models * LobsterPMLM: masked language model (BERT-style encoder-only architecture) * LobsterCBMPMLM: concept bottleneck masked language model (BERT-style encoder-only architecture with a concept bottleneck and a linear decoder) * LobsterPCLM: causal language model (Llama-style decoder-only architecture) * LobsterPLMFold: structure prediction language models (pre-trained encoder + structure head)

Notebooks

Representation learning

Check out this jupyter notebook tutorial for an example on how to extract embedding reprsentations from different models.

Concept Interventions

Check out this jupyter notebook tutorial for an example on how to intervene on different concepts for our concept-bottleneck models class.

MCP Integration

Lobster supports Model Context Protocol (MCP) for seamless integration with Claude Desktop, Cursor, and other AI tools:

```bash

Install with MCP support

uv sync --extra mcp

Setup Claude Desktop integration

uv run lobstermcpsetup ```

Setup for Cursor

Option 1: One-Click Install (Recommended)

Click the button above to automatically add the Lobster MCP server to Cursor.

Requirements: - Cursor installed - uv package manager available in PATH
- Lobster repository cloned locally with all dependencies installed (uv sync --all-extras)

After setup, you can use Lobster models directly in Claude Desktop or Cursor with natural language commands like: - "Get embeddings for this protein sequence using lobster24M" - "What concepts are supported by the cblobster_24M model?" - "Intervene on this sequence to reduce hydrophobicity"

Key Features: - Modular architecture - Clean separation of models, tools, and schemas - Multiple model types - Access to both MLM and concept bottleneck models - 5 core tools - Embeddings, concepts, interventions, naturalness, and model listing - Type-safe validation - Pydantic schemas for reliable interactions

See the MCP Integration Guide for complete documentation or MCP README for quick start instructions.

DXT Extension for Claude Desktop

Lobster is available as a DXT (Desktop Extension Toolkit) extension for Claude Desktop, providing a one-click installation experience:

Quick Install

Download: Get the latest .dxt file from GitHub Releases
Install: Double-click the .dxt file or drag it into Claude Desktop
Use: Start using Lobster models with natural language commands

Features

One-click installation - No command line setup required
Self-contained - Includes all dependencies (~500MB)
Automatic updates - New versions available through GitHub Releases
Full functionality - All MCP server capabilities included

Usage Examples

Once installed, you can use natural language commands in Claude Desktop:

``` What Lobster models are available for protein analysis?

Get embeddings for the sequence MKTVRQERLKSIVRIL using lobster_24M

What concepts are supported by the cblobster24M model?

Intervene on MKTVRQERLKSIVRIL to reduce hydrophobicity using cblobster24M ```

Development

For developers who want to build and test DXT extensions locally:

```bash

Build DXT extension locally

python scripts/build_dxt.py

Create a release (updates version, builds, and creates GitHub release)

python scripts/release_dxt.py 0.1.0 ```

See DXT Distribution Guide for detailed build and distribution instructions.

Example scripts

Check out examples for scripts showing how to perform inference and interventions.

Training and inference

Embedding

The entrypoint lobster_embed is the main driver for embedding sequences and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running lobster_embed --help or by looking in the src/lobster/hydra_config directory

To embed a fasta file of sequences using a pre-trained model on an interactive GPU node, cd into the root dir of this repo and do bash lobster_embed data.path_to_fasta="test_data/query.fasta" checkpoint="path_to_checkpoint.ckpt"

This will generate a dataframe of embeddings and also log them to wandb.

Regression and classification

For robust multitask modeling, we recommend using lobster with cortex. For simple baselines using lobster embeddings, use lobster.model.LinearProbe and lobster.model.LobsterMLP.

Likelihoods

Likelihoods from an autoregressive LobsterCLM or pseudo-log likelihoods ("naturalness") from a LobsterPMLM can be computed for a list of sequences using

python model.naturalness(sequences) model.likelihood(sequences)

Training from scratch

The entrypoint lobster_train is the main driver for training and accepts parameters using Hydra syntax. The available parameters for configuration can be found by running lobster_train --help or by looking in the src/lobster/hydra_config directory

To train an MLM on a fasta file of sequences on an interactive GPU node, cd into the root dir of this repo and do bash lobster_train data.path_to_fasta="test_data/query.fasta" logger=csv paths.root_dir="."

Reinforcement Learning with UME Reward Functions

Lobster supports reinforcement learning training using UME-based reward functions for post-training language models. This approach uses UME pseudo-likelihood scores as rewards to guide model behavior toward generating more biologically plausible sequences.

Quick Start: ```bash

Step 1: Generate synthetic dataset

cd examples python generatesyntheticdataset.py

Step 2: Run UME-based GRPO training

python trainumegrpo.py ```

Key Features: - Automatic modality detection for SMILES, amino acid, and DNA sequences - UME-based reward functions using pseudo-likelihood scores - GRPO training with TRL integration - Modular design with reusable components

For detailed instructions and advanced usage, see the RL Training Guide.

Contributing

Contributions are welcome! We ask that all users and contributors remember that the LBSTER team are all full-time drug hunters, and our open-source efforts are a labor of love because we care deeply about open science and scientific progress.

Getting started with contributions

Expanding unit test coverage, docstrings, and type hints are always welcome and a good place to start to orient yourself to the code base. Likewise for identifying and fixing 🐛bugs🐛. For more involved project ideas, check Good First Issues. All new or modified code must be unit tested before maintainers will review.

Install dev requirements and pre-commit hooks

bash pre-commit install

Create lockfile for env

bash uv pip compile requirements.in -o requirements.txt

Testing

bash python -m pytest -v --cov-report term-missing --cov=./lobster ./tests

Owner

Name: Prescient Design
Login: prescient-design
Kind: organization
Email: prescient@gene.com

Website: www.gene.com/prescient
Twitter: PrescientDesign
Repositories: 1
Profile: https://github.com/prescient-design

A Genentech Accelerator

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: "lobster"
abstract: "Language models for Biological Sequence Transformation and Evolutionary Representation."
authors:
  - family-names: "Design"
    given-names: "Prescient"
    # TODO: Replace with actual author information
    # orcid: "https://orcid.org/0000-0000-0000-0000"
version: "0.1.0"  # TODO: Update with actual version
date-released: "2025-06-09"  # TODO: Update with actual release date
url: "https://github.com/prescient-design/lobster"
repository-code: "https://github.com/prescient-design/lobster"
keywords:
  - "language models"
  - "biological sequence transformation"
  - "evolutionary representation"
  - "sequence generation"
  - "sequence editing"
  - "sequence analysis"
  - "sequence comparison"
license: Apache-2.0
preferred-citation:
  type: software
  title: "LBSTR: Language models for Biological Sequence Transformation and Evolutionary Representation"
  authors:
    - family-names: "Design"
      given-names: "Prescient"
  url: "https://github.com/prescient-design/lobster"
  year: 2025

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 20
Total pull requests: 126
Average time to close issues: 28 days
Average time to close pull requests: 3 days
Total issue authors: 8
Total pull request authors: 16
Average comments per issue: 0.1
Average comments per pull request: 0.12
Merged pull requests: 81
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 19
Pull requests: 126
Average time to close issues: 28 days
Average time to close pull requests: 3 days
Issue authors: 7
Pull request authors: 16
Average comments per issue: 0.11
Average comments per pull request: 0.12
Merged pull requests: 81
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ncfrey (7)
karinazad (5)
etherealsunshine (2)
young-su-ko (2)
abidikhairi (1)
RyanAIResearch (1)
redsphinx (1)
alexj-lee (1)

Pull Request Authors

karinazad (58)
ncfrey (27)
taylormjs (10)
kleinhenz (8)
cgrambow (5)
fdreyer (3)
edwag (3)
etherealsunshine (2)
Sidney-Lisanza (2)
tomyyyD (2)
young-su-ko (1)
chaitjo (1)
hofmannjl (1)
sjmielke (1)
rcalef (1)

Top Labels

Issue Labels

good first issue (6) help wanted (1)

Pull Request Labels

Dependencies

.github/workflows/push.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite
actions/upload-artifact v3 composite

pyproject.toml pypi

requirements-dev.in pypi

icecream * development
mypy * development
pre-commit * development
pytest * development
pytest-cov * development

requirements.in pypi

biopandas *
biopython *
datasets *
datasketch *
deepspeed *
fastparquet *
hydra-core *
icecream *
ipykernel *
jupyter *
lightning *
matplotlib *
pandas *
peft *
pyarrow *
python-dotenv *
s3fs *
scikit-learn *
scipy *
seaborn *
tokenizers *
torch *
torchdata *
torcheval *
torchmetrics *
torchvision *
transformers *
universal_pathlib *
wandb *