https://github.com/dptech-corp/deepmd-pytorch
Deprecation - migrated to DeePMD-kit
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Repository
Deprecation - migrated to DeePMD-kit
Basic Info
- Host: GitHub
- Owner: dptech-corp
- License: lgpl-3.0
- Language: Python
- Default Branch: master
- Homepage: https://github.com/deepmodeling/deepmd-kit
- Size: 77.2 MB
Statistics
- Stars: 10
- Watchers: 13
- Forks: 2
- Open Issues: 5
- Releases: 2
Metadata Files
README.md
[!CAUTION] This repository has been deprecated. The whole repository has been migrated to DeePMD-kit and released in DeePMD-kit v3.0.0a0.
This repository is written by Hang'rui Bi based on Shaochen Shi's implementation of DeePMD-kit using PyTorch. It is supposed to offer comparable accuracy and performance to the TF implementation.
Quick Start
Install
This package requires PyTorch 2. ```bash
PyTorch 2 recommends Python >= 3.8 .
conda create -n deepmd-pt python=3.10 conda activate deepmd-pt
Following instructions on pytorch.org
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia git clone https://github.com/dptech-corp/deepmd-pytorch.git pip install deepmd-pytorch
... or
pip install git+https://github.com/dptech-corp/deepmd-pytorch.git ```
Run
bash
conda activate deepmd-pt
python3 dp train tests/water/se_e2_a.json
Profiling
```bash
you may change the number of training steps before profiling
PYTHONPATH=/root/deepmdonpytorch python3 -m cProfile -o profile deepmdpt/main.py train tests/water/see2_a.json 2>&1 python -m pstats ```
References
- Original DeePMD-kit on TensorFlow https://github.com/deepmodeling/deepmd-kit
- DeePMD on PyTorch demo https://github.com/shishaochen/deepmdonpytorch
Structure
```
deepmd_pt
entrypoints main.py train train.py infer inference.py model model.py descriptor descriptor.py embeddingnet.py task fitting.py loss loss.py optimizer LKF.py KFWrapper.py utils dataset.py env.py learningrate.py my_random.py stat.py ```
Deploy
Tested with libtorch pre-CXX11 abi cu116, cuda 11.6, torch 1.13
bash
python test.py
export CMAKE_PREFIX_PATH=`python -c "import torch;print(torch.__path__[0])"`/share/cmake:$CMAKE_PREFIX_PATH
cmake -B build
cd build
cmake --build .
Test
First modify TESTCONFIG in env.py to the input config you want to test. For example, `tests/water/see2.json` is the config for a tiny water problem. The water dataset is contained in the repository.
The tests are aligned with deepmdkit 2.1.5, may fail with deepmdkit 2.2 or higher.
Distributed Data Parallelism
Currently, we support input files in traditional dp format. We construct a PyTorch DataSet for each system, and fetch batched data with a dedicated DataLoader. This guarantee the input data for one rank in one mini-batch comes from same system, i.e. has a same number of atoms, which is required by the model. Using DistributedSampler, each frame is extracted for training once and only once in one epoch, no matter how many ranks there are.
The systems vary in length, and the number of mini-batches we can get from that DataLoader differs. A index table is created on each rank, and for each DataLoader, its index value is appended to the index array in the number of the length of the DataLoader. In pseudocodes:
python
self.index: List[int] = []
self.dataloaders: List[DataLoader] = []
for system in systems:
dl = create_dataloader(system)
self.dataloaders.append(dl)
for _ in range(len(dl)): # len(dl) == how many mini-batches in this system
index.append(len(self.dataloaders) - 1)
We initialize a meta-dataset named dploaderset with the index. Each step draws out an index randomly using RandomSampler, and fetch data from the corresponding DataLoader. Hence, in one epoch, the number of every DataLoader being accessed equals the length of it, which means that all input frames are accessed without omitting.
```mermaid flowchart LR
subgraph systems
subgraph system1
direction LR
frame1[frame 1]
frame2[frame 2]
end
subgraph system2
direction LR
frame3[frame 3]
frame4[frame 4]
frame5[frame 5]
end
end
subgraph dataset
dataset1[dataset 1]
dataset2[dataset 2]
end
system1 -- frames --> dataset1
system2 --> dataset2
subgraph distribted sampler
ds1[distributed sampler 1]
ds2[distributed sampler 2]
end
dataset1 --> ds1
dataset2 --> ds2
subgraph dataloader
dataloader1[dataloader 1]
dataloader2[dataloader 2]
end
ds1 -- mini batch --> dataloader1
ds2 --> dataloader2
subgraph index[index on Rank 0]
dl11[dataloader 1, entry 1]
dl21[dataloader 2, entry 1]
dl22[dataloader 2, entry 2]
end
dataloader1 --> dl11
dataloader2 --> dl21
dataloader2 --> dl22
index -- for each step, choose 1 system --> RandomSampler
--> dploaderset --> bufferedq[buffered queue] --> model
```
For more details, please see deepmd-pytorch/deepmd_pt/utils/dataloader.py.
Run on a local cluster
We use torchrun to launch a DDP training session.
To start training with multiple GPUs in one node, set parameter nproc_per_node as the number of it:
```bash torchrun --nprocpernode=4 --no-python dp_pt train input.json
Not setting nproc_per_node uses only 1 GPU
torchrun --no-python dp_pt train input.json ```
If you wish to execute the codes under active development without pip installing, please try:
bash
PYTHONPATH=~/deepmd-pytorch torchrun ~/deepmd-pytorch/deepmd_pt/entrypoints/main.py train input.json
To train a model with a cluster, one can manually launch the task using the commands below (usually this should be done by your job management system). Set nnodes as the number of available nodes, node_rank as the rank of the current node among all nodes (not the rank of processes!), and nproc_per_node as the number of available GPUs in one node. Please make sure that every node can access the rendezvous address and port (rdzv_endpoint in the command), and has a same amount of GPUs.
```bash
Running DDP on 2 nodes with 4 GPUs each
On node 0:
torchrun --rdzvendpoint=node0:12321 --nnodes=2 --nprocpernode=4 --noderank=0 --nopython dp train tests/water/see2_a.json
On node 1:
torchrun --rdzvendpoint=node0:12321 --nnodes=2 --nprocpernode=4 --noderank=1 --nopython dp train tests/water/see2_a.json ```
Note Set environment variables to tune CPU specific optimizations in advance.
Note for developers:
torchrunby default passes settings as environment variables (list here).To check forward, backward, and communication time, please set env var
TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL. More details can be found here.
Run on slurm system
Use .sbatch file in slurm/, you may need to modify some config to run on your system
bash
sbatch distributed_data_parallel_slurm_setup.sbatch
These files are modified from: https://github.com/lkskstlr/distributeddataparallelslurmsetup
Track runs using W&B
wandb is automatically installed as a requirement for deepmd-pytorch.
First setup with wandb login, and set the corresponding fields under the "training" part in your input file (typically input.json) as follows:
jsonc
// "training": {
"wandb_config": {
"job_name": "Cu-dpa_adam_bz1_at2",
"wandb_enabled": true,
"entity": "dp_model_engineering", // a username or team name
"project": "DPA-2"
},
To disable logging temporarily, set env var WANDB_MODE=disabled.
Known Problems & TODO
Owner
- Name: DP Technology
- Login: dptech-corp
- Kind: organization
- Location: China
- Website: https://www.dp.tech/en
- Repositories: 9
- Profile: https://github.com/dptech-corp