https://github.com/dptech-corp/deepmd-pytorch

Deprecation - migrated to DeePMD-kit

https://github.com/dptech-corp/deepmd-pytorch

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary
Last synced: 5 months ago · JSON representation

Repository

Deprecation - migrated to DeePMD-kit

Basic Info
Statistics
  • Stars: 10
  • Watchers: 13
  • Forks: 2
  • Open Issues: 5
  • Releases: 2
Created almost 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

[!CAUTION] This repository has been deprecated. The whole repository has been migrated to DeePMD-kit and released in DeePMD-kit v3.0.0a0.

This repository is written by Hang'rui Bi based on Shaochen Shi's implementation of DeePMD-kit using PyTorch. It is supposed to offer comparable accuracy and performance to the TF implementation.

Quick Start

Install

This package requires PyTorch 2. ```bash

PyTorch 2 recommends Python >= 3.8 .

conda create -n deepmd-pt python=3.10 conda activate deepmd-pt

Following instructions on pytorch.org

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia git clone https://github.com/dptech-corp/deepmd-pytorch.git pip install deepmd-pytorch

... or

pip install git+https://github.com/dptech-corp/deepmd-pytorch.git ```

Run

bash conda activate deepmd-pt python3 dp train tests/water/se_e2_a.json

Profiling

```bash

you may change the number of training steps before profiling

PYTHONPATH=/root/deepmdonpytorch python3 -m cProfile -o profile deepmdpt/main.py train tests/water/see2_a.json 2>&1 python -m pstats ```

References

  • Original DeePMD-kit on TensorFlow https://github.com/deepmodeling/deepmd-kit
  • DeePMD on PyTorch demo https://github.com/shishaochen/deepmdonpytorch

Structure

```

deepmd_pt

entrypoints main.py train train.py infer inference.py model model.py descriptor descriptor.py embeddingnet.py task fitting.py loss loss.py optimizer LKF.py KFWrapper.py utils dataset.py env.py learningrate.py my_random.py stat.py ```

Deploy

Tested with libtorch pre-CXX11 abi cu116, cuda 11.6, torch 1.13

bash python test.py export CMAKE_PREFIX_PATH=`python -c "import torch;print(torch.__path__[0])"`/share/cmake:$CMAKE_PREFIX_PATH cmake -B build cd build cmake --build .

Test

First modify TESTCONFIG in env.py to the input config you want to test. For example, `tests/water/see2.json` is the config for a tiny water problem. The water dataset is contained in the repository.

The tests are aligned with deepmdkit 2.1.5, may fail with deepmdkit 2.2 or higher.

Distributed Data Parallelism

Currently, we support input files in traditional dp format. We construct a PyTorch DataSet for each system, and fetch batched data with a dedicated DataLoader. This guarantee the input data for one rank in one mini-batch comes from same system, i.e. has a same number of atoms, which is required by the model. Using DistributedSampler, each frame is extracted for training once and only once in one epoch, no matter how many ranks there are.

The systems vary in length, and the number of mini-batches we can get from that DataLoader differs. A index table is created on each rank, and for each DataLoader, its index value is appended to the index array in the number of the length of the DataLoader. In pseudocodes:

python self.index: List[int] = [] self.dataloaders: List[DataLoader] = [] for system in systems: dl = create_dataloader(system) self.dataloaders.append(dl) for _ in range(len(dl)): # len(dl) == how many mini-batches in this system index.append(len(self.dataloaders) - 1)

We initialize a meta-dataset named dploaderset with the index. Each step draws out an index randomly using RandomSampler, and fetch data from the corresponding DataLoader. Hence, in one epoch, the number of every DataLoader being accessed equals the length of it, which means that all input frames are accessed without omitting.

```mermaid flowchart LR

subgraph systems
    subgraph system1
        direction LR
        frame1[frame 1]
        frame2[frame 2]
    end

    subgraph system2
        direction LR
        frame3[frame 3]
        frame4[frame 4]
        frame5[frame 5]
    end
end

subgraph dataset
    dataset1[dataset 1]
    dataset2[dataset 2]
end
system1 -- frames --> dataset1
system2 --> dataset2

subgraph distribted sampler
    ds1[distributed sampler 1]
    ds2[distributed sampler 2]
end
dataset1 --> ds1
dataset2 --> ds2

subgraph dataloader
    dataloader1[dataloader 1]
    dataloader2[dataloader 2]
end
ds1 -- mini batch --> dataloader1
ds2 --> dataloader2

subgraph index[index on Rank 0]
    dl11[dataloader 1, entry 1]
    dl21[dataloader 2, entry 1]
    dl22[dataloader 2, entry 2]
end
dataloader1 --> dl11
dataloader2 --> dl21
dataloader2 --> dl22

index -- for each step, choose 1 system --> RandomSampler
--> dploaderset --> bufferedq[buffered queue] --> model

```

For more details, please see deepmd-pytorch/deepmd_pt/utils/dataloader.py.

Run on a local cluster

We use torchrun to launch a DDP training session.

To start training with multiple GPUs in one node, set parameter nproc_per_node as the number of it:

```bash torchrun --nprocpernode=4 --no-python dp_pt train input.json

Not setting nproc_per_node uses only 1 GPU

torchrun --no-python dp_pt train input.json ```

If you wish to execute the codes under active development without pip installing, please try:

bash PYTHONPATH=~/deepmd-pytorch torchrun ~/deepmd-pytorch/deepmd_pt/entrypoints/main.py train input.json

To train a model with a cluster, one can manually launch the task using the commands below (usually this should be done by your job management system). Set nnodes as the number of available nodes, node_rank as the rank of the current node among all nodes (not the rank of processes!), and nproc_per_node as the number of available GPUs in one node. Please make sure that every node can access the rendezvous address and port (rdzv_endpoint in the command), and has a same amount of GPUs.

```bash

Running DDP on 2 nodes with 4 GPUs each

On node 0:

torchrun --rdzvendpoint=node0:12321 --nnodes=2 --nprocpernode=4 --noderank=0 --nopython dp train tests/water/see2_a.json

On node 1:

torchrun --rdzvendpoint=node0:12321 --nnodes=2 --nprocpernode=4 --noderank=1 --nopython dp train tests/water/see2_a.json ```

Note Set environment variables to tune CPU specific optimizations in advance.

Note for developers: torchrun by default passes settings as environment variables (list here).

To check forward, backward, and communication time, please set env var TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL. More details can be found here.

Run on slurm system

Use .sbatch file in slurm/, you may need to modify some config to run on your system

bash sbatch distributed_data_parallel_slurm_setup.sbatch

These files are modified from: https://github.com/lkskstlr/distributeddataparallelslurmsetup

Track runs using W&B

wandb is automatically installed as a requirement for deepmd-pytorch.

First setup with wandb login, and set the corresponding fields under the "training" part in your input file (typically input.json) as follows:

jsonc // "training": { "wandb_config": { "job_name": "Cu-dpa_adam_bz1_at2", "wandb_enabled": true, "entity": "dp_model_engineering", // a username or team name "project": "DPA-2" },

To disable logging temporarily, set env var WANDB_MODE=disabled.

Known Problems & TODO

Owner

  • Name: DP Technology
  • Login: dptech-corp
  • Kind: organization
  • Location: China

GitHub Events

Total
Last Year