https://github.com/bytedance/dplm

The Family of Diffusion Protein Language Models (DPLM)

Keywords

research

Last synced: 8 months ago · JSON representation

Repository

The Family of Diffusion Protein Language Models (DPLM)

Basic Info

Host: GitHub
Owner: bytedance
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://bytedance.github.io/dplm/
Size: 19 MB

Statistics

Stars: 220
Watchers: 6
Forks: 31
Open Issues: 11
Releases: 0

Topics

research

Created almost 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License

The Family of Diffusion Protein Language Models (DPLM)

Overview 🌟

This repository contains the official implementation of training and inference as well as the pre-trained weights for the Family of Diffusion Protein Language Models (DPLM), including: - DPLM from ICML'24 paper "Diffusion Language Models Are Versatile Protein Learners", which introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. - DPLM-2 from ICLR'25 paper "DPLM-2: A Multimodal Diffusion Protein Language Model", a multimodal protein foundation model that extends discrete diffusion protein language model to accommodate both sequences and structures. - ICML'25 spotlight paper "Elucidating the Design Space of Multimodal Protein Language Models", where we elucidate the challenges of structure modeling of multimodal protein language models (e.g., DPLM-2 and ESM3) and propose advanced designs for better structure modeling. We have released the finer-grained bit-based generative modeling (DPLM-2 Bit). The full implementation of the paper will be released soon.

Key Features 🔑

Specifically, the DPLM family exhibits impressive performance in protein (structure and sequence) co-generation, any-to-any conditional generation (e.g., folding, inverse folding, and motif scaffolding), and representation learning. We develop DPLM based on the ByProt. This repository contains pretraining scripts for DPLM and running scripts for various protein generation and understanding tasks, as detailed below: - Unconditional protein generation: DPLM is capable of unconditionally generating protein sequences with reasonable predicted structures. DPLM-2 can generate diverse and highly plausible proteins through simultaneous structure-sequence co-generation. - Sequence-conditioned generation (forward folding): DPLM-2 can generate reasonable protein structure given the input protein sequence, achieving close performance with the strong folding model (e.g., ESMFold). - Structure-conditioned generation (inverse folding): DPLM and DPLM-2 can produce sequences that can confidently fold into the given backbone structure. - Motif scaffolding: DPLM can generate reasonable scaffold sequences given specific functional motifs. DPLM-2 achieves more successful motif scaffolding through multimodal motif conditioning. - Representation learning: DPLM is a superior protein sequence representation learner, while DPLM-2 offers structure-aware protein represenrations, demonstrating impressive performance across a variety of protein predictive tasks. - Controllable generation: DPLM enjoys plug-and-play programmability, generating samples satisfying provided secondary structure annotations.

TODOs

[ ] Controllable/guided generation with discrete diffusion classifier guidance.
[ ] Representation learning of DPLM-2

DPLM

"Diffusion Language Models Are Versatile Protein Learners." Wang et al., In ICML 2024

DPLM

DPLM-2

"DPLM-2: A Multimodal Diffusion Protein Language Model." Wang et al., In ICLR 2025

DPLM-2

Updates 📢

[2025-07] We update the default sampling strategy of DPLM-2 to annealing@2.0:0.1.
[2025-04] Our latest work DPLM-2.1, which focuses on analysis and better protein structure modeling of multimodal protein language models, is accepted to ICML'25 Spotlight! Check Elucidating the Design Space of Multimodal Protein Language Models. We have release the implementation of finer-grained and better structure modeling (DPLM-2 Bit). The full implementation will be released soon.
[2024-10] Check out our new work DPLM-2, a multimodal protein foundation model that extends DPLM to simultaneously model, understand, and generate both sequences and structures!
[2024-03] We release DPLM, a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences!

Quick Start

Installation

```bash

clone project

git clone --recursive https://url/to/this/repo/dplm.git cd dplm

create conda virtual environment

env_name=dplm

conda create -n ${envname} python=3.9 pip conda activate ${envname}

automatically install everything else

bash scripts/install.sh ```

Load Pretrained Models

Users can load DPLM/DPLM-2 checkpoint by:

```python from byprot.models.dplm import DiffusionProteinLanguageModel as DPLM from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2 from byprot.models.dplm2 import DPLM2Bit

dplm = DPLM.frompretrained("airkingbd/dplm650m").cuda() dplm2 = DPLM2.frompretrained("airkingbd/dplm2650m").cuda() dplm2bit = DPLM2Bit.frompretrained("airkingbd/dplm2bit650m").cuda() ```

Generation Examples

Protein sequence generation

```python from generatedplm import initializegeneration

inputtokens = initializegeneration( length=200, numseqs=5, tokenizer=dplm.tokenizer, device=next(dplm.parameters()).device ) samples = dplm.generate( inputtokens=inputtokens, maxiter=500, ) print([''.join(seq.split(' ')) for seq in dplm.tokenizer.batchdecode(samples, skipspecial_tokens=True)]) ```

Protein sequence-structure co-generation

User can check the generated sequence and structure in the ./generation-results folder. ```python from generatedplm2 import initializegeneration, save_results

inputtokens = initializegeneration( task="cogeneration", length=200, numseqs=5, tokenizer=dplm2.tokenizer, device=next(dplm2.parameters()).device )[0]

samples = dplm2.generate( inputtokens=inputtokens, maxiter=500, ) saveresults( outputs=samples, task="cogeneration", savedir="./generation-results/dplm2generation", tokenizer=dplm2.tokenizer, structtokenizer=dplm2.structtokenizer, savepdb=True )

samples = dplm2bit.generate( inputtokens=inputtokens, maxiter=500, ) saveresults( outputs=samples, task="cogeneration", savedir="./generation-results/dplm2bitgeneration", tokenizer=dplm2bit.tokenizer, structtokenizer=dplm2bit.struct_tokenizer ) ```

Model Checkpoints

Access pretrained models in varying sizes:

| Model name | Model size | | ------------------------------------------------------------ | --------------- | | dplm-150m | 150M parameters | | dplm-650m | 650M parameters | | dplm-3b | 3B parameters | | dplm2-150m | 150M parameters | | dplm2-650m | 650M parameters | | dplm2-3b | 3B parameters | | dplm2-bit-650m | 650M parameters |

Advanced Usage

Training

DPLM

Dataset

We pretrain DPLM on the UniRef50 dataset, which contains about 42 million protein sequences. We obtain the preprocessed UniRef50 dataset provided by EvoDiff (Alamdari et al, 2023), which can be downloaded from this link. After downloading, please place the dataset in the ./data-bin/uniref50 folder.

We also provide the preprocessed dataset in HuggingFace datasets format, which we recommend to use. User can download the HF dataset locally in advance for faster loading by: bash bash scripts/download_uniref50_hf.sh

Example of training

We train DPLM with approximately 1 million tokens per batch for 100,000 training steps.

The following command is run on one node with 8 A100 GPUs. If you want to train on multiple nodes, you can adjust the total number of tokens by ensuring that max_tokens * accumulate_grad_batches*#GPUs is approximately 1 million.

```bash export CUDAVISIBLEDEVICES=0,1,2,3,4,5,6,7

maxtokens=8192 accumulategrad_batches=16

this means the effective batch size is #GPUs(8) * maxtokens(8192) * accumulategrad_batches(16), resulting in approximately 1 million.

exp=dplm/dplm650m modelname=dplm_650m

python train.py \ experiment=${exp} name=${modelname} \ datamodule.maxtokens=${maxtokens} \ trainer.accumulategradbatches=${accumulategrad_batches} ```

You can adjust the other training configurations in the configs/experiment/dplm/dplm_650m.yaml as needed.

DPLM-2

Dataset

We use the experimental structures from PDB and AF2-predicted structures from SwissProt dataset as training data for DPLM-2. We provide a preprocessed HuggingFace dataset of PDB and SwissProt. User can download the HF dataset locally in advance for faster loading by: bash bash scripts/download_pdb_swissprot.sh

Example of training

As noted in section 3.2 in DPLM-2 paper, we propose an efficient warm-up training strategy to mitigate the scarcity of structure training data. During training, we initialize the DPLM-2 model with pretrained DPLM checkpoint, to leverage the evolutionary knowledge captured by sequence-based pLM during large-scale sequence pretraining, which is beneficial for structure modeling.

We train DPLM-2 with approximately 64,000 tokens per batch for 100,000 training steps. To preserve the evolutionary knowledge captured by DPLM, we use the LoRA to prevent large parameter shifts. The training command is as follows:

```bash export CUDAVISIBLEDEVICES=0,1,2,3,4,5,6,7

maxtokens=8192 accumulategrad_batches=1

this means the effective batch size is #GPUs(8) * maxtokens(8192) * accumulategrad_batches(1), resulting in approximately 64 thousand.

exp=dplm2/dplm2650m modelname=dplm2_650m

python train.py \ experiment=${exp} name=${modelname} \ datamodule.maxtokens=${maxtokens} \ trainer.accumulategradbatches=${accumulategrad_batches} ```

DPLM-2 Bit-based Modeling

In our latest work DPLM-2.1, we show that the index-based structure token is challenging for the model to predict. A finer-grained, bit-based modeling approach in the latent space (i.e., predicting each bit of the quantized structure feature instead of the index) leads to better structural modeling and generation performance.

The training dataset is same to DPLM-2, and the training command is as below: ```bash export CUDAVISIBLEDEVICES=0,1,2,3,4,5,6,7

maxtokens=8192 accumulategrad_batches=1

this means the effective batch size is #GPU(8) * maxtokens(8192) * accumulategrad_batches(1), resulting in approximately 64 thousand.

exp=dplm2/dplm2bit650m modelname=dplm2bit_650m

python train.py \ experiment=${exp} name=${modelname} \ datamodule.maxtokens=${maxtokens} \ trainer.accumulategradbatches=${accumulategrad_batches} ```

Unconditional protein (co-)generation

Protein sequence generation (DPLM)

The results of unconditional protein sequence generation of DPLM of different scales (150M, 650M, 3B) are shown in the table below. For more details, please refer to our paper.

| Length | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 | | ------ | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | -------------- | -------------- | | 150M | 73.31 | 84.30 | 84.82 | 86.90 | 81.71 | 81.53 | 81.56 | 80.92 | 78.71 | 72.10 | | 650M | 74.00 (+0.69) | 85.61 (+1.31) | 85.91 (+1.09) | 88.16 (+1.26) | 82.58 (+0.87) | 84.38 (+2.85) | 83.87 (+2.31) | 83.00 (+2.08) | 84.92 (+6.21) | 81.51 (+9.41) | | 3B | 77.78 (+4.47) | 86.16 (+1.86) | 87.39 (+2.57) | 90.06 (+3.16) | 87.43 (+5.72) | 86.01 (+4.48) | 84.64 (+3.08) | 85.88 (+4.96) | 85.93 (+7.22) | 83.86 (+11.76) |

To generate new protein sequences using a pre-trained DPLM model:

```bash modelname=dplm650m # choose from dplm150m, dplm650m, dplm3b outputdir=generation-results/${modelname}/uncondgeneration

mkdir -p generation-results

python generatedplm.py --modelname airkingbd/${modelname} \ --seqlens 100 200 300 400 500 \ --saveto ${output_dir}

Evaluation

bash anylasis/plddtcalculate.sh ${outputdir} # compute pLDDT using ESMFold ```

We also provide evaluation scripts in the analysis folder. Users can use the analysis/uncond_analysis.ipynb to obtain average pLDDT score of each length and draw the line chart of the pLDDT score.

Protein sequence-structure co-generation (DPLM-2 & DPLM-2-Bit)

DPLM-2 can generate diverse and highly-plausible protein with simultaneous structure-sequence co-generation.  Descriptive text for your image

User can co-generate sequence and structure simultaneously with the command below:

```bash

choose from dplm2150m, dplm2650m, dplm2_3b

modelname=dplm2650m

About the default sampling strategy, annealing@2.0:0.1,

which anneals the temperature from 2.0 to 0.1.

It begins with high randomness to maximize diversity

and concludes with low randomness to ensure designability.

This achieves a better trade-off between the quality and diversity.

sampling_strategy=annealing@2.0:0.1

outputdir=generation-results/${modelname} task=co_generation

mkdir -p ${output_dir}

python generatedplm2.py \ --modelname airkingbd/${modelname} \ --task ${task} \ --samplingstrategy ${samplingstrategy} \ --numseqs 50 \ --maxiter 500 \ --seqlens 100 200 300 400 500 \ --saveto ${output_dir}

Evaluation

inputfastadir=${outputdir}/cogeneration python src/byprot/utils/protein/evaluatordplm2.py -cn unconditionalcodesign \ inference.inputfastadir=${inputfastadir} ``User can useanalysis/plot.ipynb` to plot the rmsd, tmscore distribution and diversity of each length.

Co-generate sequence and structure with dplm-2 bit modeling variant: ```bash modelname=dplm2bit650m samplingstrategy=annealing@1.1:0.1

outputdir=generation-results/${modelname} task=co_generation

mkdir -p ${output_dir}

python generatedplm2.py \ --modelname airkingbd/${modelname} \ --task ${task} \ --bitmodel \ --samplingstrategy ${samplingstrategy} \ --numseqs 50 \ --maxiter 500 \ --seqlens 100 200 300 400 500 \ --saveto ${outputdir} ```

Sequence-conditioned Generation: Forward Folding

DPLM-2 spontaneously enables protein structure prediction given sequence (i.e., folding) in a zero-shot manner. We use the CAMEO 2022 (provided by EigenFold) and a PDB date split (provided by MultiFlow) as testsets, and we provide our preprocessed dataset in this link, and can be downloaded by: bash bash script/download_metadata.sh Partial results are shown in the table below. For more details, please refer to DPLM-2.1 paper. | Models | CAMEO 2022 | | PDB date | | |---|---|---|---|---| | | rmsd | tmscore | Rmsd | tmscore | | ESMFold | 3.99 | 0.85 | 2.84 | 0.93 | | DPLM-2 | 7.70 | 0.79 | 5.30 | 0.83 | | DPLM-2 Bit | 6.40 | 0.84 | 3.22 | 0.90 |

The folding generation and evaluation script is as follows. We utilize RMSD and TMscore between the predicted and ground truth structures for evaluation. DPLM-2 adopts argmax decoding for 100 sampling iterations.

```bash modelname=dplm2650m outputdir=generation-results/${modelname} task=folding

mkdir -p ${output_dir}

inputfastapath=data-bin/cameo2022/aatype.fasta python generatedplm2.py \ --modelname airkingbd/${modelname} \ --task ${task} \ --inputfastapath ${inputfastapath} \ --maxiter 100 \ --unmaskingstrategy deterministic \ --samplingstrategy argmax \ --saveto ${output_dir}

Evaluation

inputfastadir=${outputdir}/folding python src/byprot/utils/protein/evaluatordplm2.py -cn forwardfolding inference.inputfastadir=${inputfasta_dir} ```

For structure prediction conditioned on other customized sequences, users can input a FASTA file and modify the input_fasta_path variable to generate the predicted structure.

Structure-conditioned generation: inverse folding

DPLM family can perform inverse folding in different ways according to DPLM variant. DPLM performs inverse folding by placing an adapter layer on the top of pLM, similar to LM-Design. On the other hand, DPLM-2 directly conditions on the tokenized structure tokens to predict the sequence.

Inverse Folding with DPLM

Partial results on the CATH 4.3 dataset are shown in the table below. For more details, please refer to our paper.

| Models | Trainable Params. | AAR | scTM | pLDDT | |-----------|-------------------|-----------|----------|-----------| | LM-Design | 6.3M/650M | 56.49 | 0.85 | 74.89 | | DPLM-150M | 3.1M/150M | 53.27 | 0.85 | 75.31 | | DPLM-650M | 6.3M/650M | 56.61 | 0.86 | 76.78 | | DPLM-3B | 68.2M/3.0B | 58.64 | 0.86 | 76.95 |

Data

Download the preproceesd CATH datasets

CATH 4.2 dataset provided by Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19)
CATH 4.3 dataset provided by Learning inverse folding from millions of predicted structures (Hsu et al, ICML'22)

bash bash scripts/download_cath.sh

Training

We train structure-conditional DPLM based on the LM-Design framework, designating the pre-trained protein language model as DPLM. The training script is as below.

```bash exp=dplm/dplm650minvfold dataset=cath4.3 name=${dataset}/dplm650m/invfold

python train.py \ experiment=${exp} datamodule=${dataset} name=${name} \ logger=tensorboard trainer=ddp_fp16 ```

Evaluation on valid/test datasets

Users can set the eval_sc to true to calculate the self-consistency TMscore and pLDDT, which will result in a significant evaluation time overhead.

```bash dataset=cath4.3 exppath=${dataset}/dplm650m/invfold evalsc=false

if set ${eval_sc} to true, the program will calculate the self-consistency

TMscore and pLDDT during generation,

thus siginificantly increase the evaluation time.

python test.py \ experimentpath=${exppath} \ datasplit=test ckptpath=best.ckpt mode=predict \ task.generator.maxiter=100 task.generator.evalsc=${eval_sc} ```

Inverse Folding with DPLM-2

We provide the CAMEO 2022 and PDB date test set split used in our paper, where the structure has been tokenized and saved to data-bin/cameo2022/struct.fasta and data-bin/PDB_date/struct.fasta. User can use the following script to do the inverse folding and evaluation.

```bash modelname=dplm2650m outputdir=generation-results/${modelname} task=inverse_folding

mkdir -p ${output_dir}

inputfastapath=data-bin/cameo2022/struct.fasta python generatedplm2.py \ --modelname airkingbd/${modelname} \ --task ${task} \ --inputfastapath ${inputfastapath} \ --maxiter 100 \ --unmaskingstrategy deterministic \ --samplingstrategy argmax \ --saveto ${output_dir}

Evaluation

inputfastadir=${outputdir}/inversefolding python src/byprot/utils/protein/evaluatordplm2.py -cn inversefolding inference.inputfastadir=${inputfastadir} For any customized input structure, user can first tokenize the structure with structure tokenizer and save it to a FASTA file using the following script:bash

Tokenize

each protein is represented by a pdb file

inputpdbfolder=/path/to/your/input/structure

this will save two fasta files in the ${inputpdbfolder}/tokenized_protein folder:

1) struct.fasta, containing the tokenized structure tokens

2) aatype.fasta, containing the amino acid tokens.

python src/byprot/utils/protein/tokenizepdb.py --inputpdbfolder ${inputpdbfolder} --outputdir ${inputpdbfolder}/tokenized_protein ``Then user can specify the path of generatedstruct.fasta` as input and predict the sequence.

Motif scaffolding

DPLM and DPLM-2 can both perform motif scaffolding. DPLM can condition on the motif sequence and predict the scaffold sequence. DPLM-2 is able to condition on both the sequence and structure of the motif and simultaneously co-generate the sequence and structure of the scaffold part, which leads to better performance.

We examine on the benchmark, provided by FrameFlow. We use the motif pdb files which are provided by EvoDiff, and we also provide the pdbs and the corresponding structure tokens in this link. You can download the dataset by bash bash scripts/download_motif_scaffolds.sh For each motif-scaffolding problem, we sample 100 sequences and then calculate the success rate according to two aspects: motif part consistency and overall quality. For motif part consistency, we use the motif-RMSD < 1$\AA$ as the success criterion. For overall quality, the assessment varies across different approaches: sequence-based method (DPLM) we use pLDDT > 70, while for co-generation method (DPLM-2) we use scTM > 0.8. For more details, please refer to our paper.

The success rate of each motif-scaffold problem is shown below.

| | Pass rate | Avg. Success rate | 1BCF | 1PRW | 1QJG | 1YCR | 2KL8 | 3IXT | 4JHW | 4ZYP | 5IUS | 5TPN | 5TRVlong | 5TRVmed | 5TRVshort | 5WN9 | 5YUI | 6E6Rlong | 6E6Rmed | 6E6Rshort | 6EXZlong | 6EXZmed | 6EXZshort | 7MRXlong | 7MRXmed | 7MRXshort | |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | DPLM | 11/24 | 0.19 | 0.00 | 0.83 | 0.00 | 0.38 | 0.08 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.65 | 0.94 | 0.87 | 0.01 | 0.00 | 0.00 | 0.02 | 0.31 | 0.34 | | DPLM-2 | 18/24 | 0.29 | 0.01 | 0.84 | 0.02 | 0.53 | 0.57 | 0.41 | 0.00 | 0.10 | 0.00 | 0.00 | 0.00 | 0.02 | 0.03 | 0.00 | 0.00 | 0.78 | 0.77 | 0.64 | 0.44 | 0.55 | 0.58 | 0.20 | 0.22 | 0.24 |

DPLM

We provide the following script to sample sequences for each motif-scaffolding problem. Note that before generation, you should download the motif pdbs and place them in the data-bin/scaffolding-pdbs folder.

```bash export CUDAVISIBLEDEVICES=0

modelname=dplm650m outputdir=./generation-results/${modelname}/motif_scaffold

mkdir -p generation-results

Generate scaffold

python run/scaffoldgeneratedplm.py \ --modelname airkingbd/${modelname} \ --numseqs 100 \ --saveto $outputdir

Predict structure by ESMFold

maxtokens=1024 pdbpath=$outputdir/scaffoldfasta/esmfold_pdb

folding

mkdir -p $pdb_path

echo 'folding by ESMFold' outputfilenamelist=$(ls ${outputdir}/scaffoldfasta) echo $outputfilenamelist

python analysis/calplddtdir.py -i ${outputdir}/scaffoldfasta -o ${pdbpath} --max-tokens-per-batch ${maxtokens} ```

For evaluation, users can use the analysis/motif_analysis.ipynb to obtain success rate of each problem.

DPLM-2

Before generation, the FASTA file of tokenized structure tokens and amino acid tokens of the motif should be in the data-bin/scaffolding-pdbs folder. Users can co-generate the scaffold sequence and structure, conditioning on the sequence and structure of the motif part. ```bash export CUDAVISIBLEDEVICES=0

modelname=dplm2650m outputdir=./generation-results/${modelname}/motif_scaffold

mkdir -p generation-results

Generate scaffold

python run/scaffoldgeneratedplm2.py \ --modelname airkingbd/${modelname} \ --numseqs 100 \ --saveto ${outputdir}

Predict structure by ESMFold

maxtokens=1024 python analysis/calplddtdir.py -i ${outputdir}/scaffoldfasta --max-tokens-per-batch ${maxtokens}

Calculate sc-TMscore

python src/byprot/utils/protein/evaluatordplm2.py -cn unconditionalcodesign \ inference.inputfastadir=${outputdir}/scaffoldfasta inference.calculatediversity=false ``For evaluation, users can use theanalysis/motifanalysis.ipynb` to obtain success rate of each problem.

Representation Learning

The DPLM family excels in various downstream protein predictive tasks. DPLM is a superior protein sequence representation learner, while DPLM-2 can perform multimodal representation learning by leveraging both structure and sequence information, demonstrating its versatility and effectiveness. The following table summarizes the DPLM family performance, and the italic number means performance of DPLM-2, which offers structure-aware protein representations and outperforms sequence-based DPLM on most of the predictive tasks. Meanwhile, we also find the performance improves along with the model size.

| Models | Thermostability | HumanPPI | Metal Ion Binding | EC | GO-MF | GO-BP | GO-CC | DeepLoc-Subcellular | DeepLoc-Binary | | --------------------- | --------------- | --------- | ----------------- | --------- | :-------: | :-------: | :-------: | ------------------- | -------------- | | ESM2 (650M) | 0.691 | 84.78 | 71.88 | 0.866 | 0.676 | 0.344 | 0.402 | 83.68 | 92.28 | | AR-LM | 0.638 | 68.48 | 61.66 | 0.691 | 0.566 | 0.258 | 0.287 | 68.53 | 88.31 | | DPLM (150M) | 0.687 | 80.98 | 72.17 | 0.822 | 0.662 | 0.328 | 0.379 | 82.41 | 92.63 | | DPLM (650M) | 0.695 | 86.41 | 75.15 | 0.875 | 0.680 | 0.357 | 0.409 | 84.56 | 93.09 | | DPLM-2 (650M) | 0.714 | 84.44 | 74.28 | 0.878 | 0.680 | 0.359 | 0.411 | 82.98 | 93.64 | | DPLM-2 (650M) | -- | 87.78 | -- | --| --| -- | -- | 83.42 | -- | | DPLM (3B) | 0.704 | *90.00** | 75.94 | 0.883 | 0.687 | 0.369 | 0.463 | 85.32 | 93.93 |

We find DPLM-2 demonstrates a performance degradation on some tasks (e.g., HumanPPI and DeepLoc-Subcellular), due to continue training on smaller magnitude of structure data and result in overfitting and degradation of the representations learned during large-scale sequence pretraining. * means training on the larger-scale AFDB representative structure data, and we find that enlarging structure data is indeed a key factor for better multimodal protein representations. Please refer to DPLM-2 paper for more details about this.

The training and evaluation pipeline is based on the SaProt repository, and we slightly modify the code to support DPLM. Users can select the "representationlearning" branch for the evaluation of protein predictive tasks.

Acknowledgements

DPLM extends its gratitude to the following projects and individuals.

We draw inspiration and leverages/modifies implementations from: - microsoft/evodiff for the preprocessed UniRef50 dataset, sequence sampling evaluation implementation and data pipeline. - westlake-repl/SaProt for the representation learning evaluation pipeline. - jingraham/neurips19-graph-protein-design for the preprocessed CATH dataset. - facebook/esm for their ESM implementations and pretrained model weights. - jasonkyuyim/se3_diffusion for their self-consistency structural evaluation implementation. - jasonkyuyim/multiflow for their evaluation pipeline, structure data processing and preprocessed PDB dataset. - bjing2016/EigenFold for the CAMEO 2022 dataset.

We express our sincere appreciation to the authors of these repositories for their invaluable contributions to the development of DPLM family.

Citation

@inproceedings{wang2024dplm, title={Diffusion Language Models Are Versatile Protein Learners}, author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, booktitle={International Conference on Machine Learning}, year={2024} } @inproceedings{wang2025dplm2, title={DPLM-2: A Multimodal Diffusion Protein Language Model}, author={Wang, Xinyou and Zheng, Zaixiang and Ye, Fei and Xue, Dongyu and Huang, Shujian and Gu, Quanquan}, booktitle={International Conference on Learning Representations}, year={2025} } @inproceedings{hsieh2025dplm2_1, title={Elucidating the Design Space of Multimodal Protein Language Models}, author={Hsieh, Cheng-Yen and Wang, Xinyou and Zhang, Daiheng and Xue, Dongyu and Ye, Fei and Huang, Shujian and Zheng, Zaixiang and Gu, Quanquan}, booktitle={International Conference on Machine Learning}, year={2025} }

Owner

Name: Bytedance Inc.
Login: bytedance
Kind: organization
Location: Singapore

Website: https://opensource.bytedance.com
Twitter: ByteDanceOSS
Repositories: 255
Profile: https://github.com/bytedance

GitHub Events

Total

Issues event: 35
Watch event: 177
Issue comment event: 30
Push event: 25
Pull request event: 12
Fork event: 21
Create event: 1

Last Year

Issues event: 35
Watch event: 177
Issue comment event: 30
Push event: 25
Pull request event: 12
Fork event: 21
Create event: 1

Committers

Last synced: 12 months ago

All Time

Total Commits: 21
Total Committers: 3
Avg Commits per committer: 7.0
Development Distribution Score (DDS): 0.19

Past Year

Commits: 21
Committers: 3
Avg Commits per committer: 7.0
Development Distribution Score (DDS): 0.19

Top Committers

Name	Email	Commits
wangxinyou.203	w**3@b**m	17
zhengzaixiang	z**g@b**m	3
OliverT1	o**1@g**m	1

Committer Domains (Top 20 + Academic)

bytedance.com: 2

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 31
Total pull requests: 12
Average time to close issues: 26 days
Average time to close pull requests: 2 days
Total issue authors: 27
Total pull request authors: 2
Average comments per issue: 1.35
Average comments per pull request: 0.0
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 25
Pull requests: 11
Average time to close issues: 20 days
Average time to close pull requests: 3 days
Issue authors: 24
Pull request authors: 2
Average comments per issue: 0.96
Average comments per pull request: 0.0
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

zihaoli0629 (2)
hhhhh789 (2)
done520 (2)
OrangeCat7777777 (2)
Barabaika (2)
abcdxs117 (2)
paperClub-hub (2)
YuX-Ren (1)
Lvchangze (1)
hml1996-fight (1)
Yangfan-YU (1)
xiaoxiaokuye (1)
YusenZheng (1)
haresh121 (1)
TheMatrixMaster (1)

https://github.com/bytedance/dplm

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

The Family of Diffusion Protein Language Models (DPLM)

Overview 🌟

Key Features 🔑

DPLM

DPLM-2

Updates 📢

Table of Contents 📚

Quick Start

Installation

clone project

create conda virtual environment

automatically install everything else

Load Pretrained Models

Generation Examples

Model Checkpoints

Advanced Usage

Training

DPLM

Dataset

Example of training

this means the effective batch size is #GPUs(8) * maxtokens(8192) * accumulategrad_batches(16), resulting in approximately 1 million.

DPLM-2

Dataset

Example of training

this means the effective batch size is #GPUs(8) * maxtokens(8192) * accumulategrad_batches(1), resulting in approximately 64 thousand.

DPLM-2 Bit-based Modeling

this means the effective batch size is #GPU(8) * maxtokens(8192) * accumulategrad_batches(1), resulting in approximately 64 thousand.

Unconditional protein (co-)generation

Protein sequence generation (DPLM)

Evaluation

Protein sequence-structure co-generation (DPLM-2 & DPLM-2-Bit)

choose from dplm2150m, dplm2650m, dplm2_3b

About the default sampling strategy, annealing@2.0:0.1,

which anneals the temperature from 2.0 to 0.1.

It begins with high randomness to maximize diversity

and concludes with low randomness to ensure designability.

This achieves a better trade-off between the quality and diversity.

Evaluation

Sequence-conditioned Generation: Forward Folding

Evaluation

Structure-conditioned generation: inverse folding

Inverse Folding with DPLM

Data

Training

Evaluation on valid/test datasets

if set ${eval_sc} to true, the program will calculate the self-consistency

TMscore and pLDDT during generation,

thus siginificantly increase the evaluation time.

Inverse Folding with DPLM-2

Evaluation

Tokenize

each protein is represented by a pdb file

this will save two fasta files in the ${inputpdbfolder}/tokenized_protein folder:

1) struct.fasta, containing the tokenized structure tokens

2) aatype.fasta, containing the amino acid tokens.

Motif scaffolding

DPLM

Generate scaffold

Predict structure by ESMFold

folding

DPLM-2

Generate scaffold

Predict structure by ESMFold

Calculate sc-TMscore

Representation Learning

Acknowledgements

Citation

Owner

GitHub Events

Total

Last Year