pytorch-benchmark
Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary
Keywords
Repository
Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption
Basic Info
Statistics
- Stars: 103
- Watchers: 3
- Forks: 11
- Open Issues: 3
- Releases: 12
Topics
Metadata Files
README.md
⏱ pytorch-benchmark
Easily benchmark model inference FLOPs, latency, throughput, max allocated memory and energy consumption
*Actual coverage is higher as GPU-related code is skipped by Codecov
Install
bash
pip install pytorch-benchmark
Usage
```python import torch from torchvision.models import efficientnetb0 from pytorchbenchmark import benchmark
model = efficientnetb0().to("cpu") # Model device sets benchmarking device sample = torch.randn(8, 3, 224, 224) # (B, C, H, W) results = benchmark(model, sample, numruns=100) ```
Sample results 💻
Macbook Pro (16-inch, 2019), 2.6 GHz 6-Core Intel Core i7
``` device: cpu flops: 401669732 machine_info: cpu: architecture: x86_64 cores: physical: 6 total: 12 frequency: 2.60 GHz model: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz gpus: null memory: available: 5.86 GB total: 16.00 GB used: 7.29 GB system: node: d40049 release: 21.2.0 system: Darwin params: 5288548 timing: batch_size_1: on_device_inference: human_readable: batch_latency: 74.439 ms +/- 6.459 ms [64.604 ms, 96.681 ms] batches_per_second: 13.53 +/- 1.09 [10.34, 15.48] metrics: batches_per_second_max: 15.478907181264278 batches_per_second_mean: 13.528026359855625 batches_per_second_min: 10.343281300091244 batches_per_second_std: 1.0922382209314958 seconds_per_batch_max: 0.09668111801147461 seconds_per_batch_mean: 0.07443853378295899 seconds_per_batch_min: 0.06460404396057129 seconds_per_batch_std: 0.006458734193132054 batch_size_8: on_device_inference: human_readable: batch_latency: 509.410 ms +/- 30.031 ms [405.296 ms, 621.773 ms] batches_per_second: 1.97 +/- 0.11 [1.61, 2.47] metrics: batches_per_second_max: 2.4673319862230025 batches_per_second_mean: 1.9696935126370148 batches_per_second_min: 1.6083039834656554 batches_per_second_std: 0.11341204895590185 seconds_per_batch_max: 0.6217730045318604 seconds_per_batch_mean: 0.509410228729248 seconds_per_batch_min: 0.40529608726501465 seconds_per_batch_std: 0.030031445467788704 ```Server with NVIDIA GeForce RTX 2080 and Intel Xeon 2.10GHz CPU
``` device: cuda flops: 401669732 machine_info: cpu: architecture: x86_64 cores: physical: 16 total: 32 frequency: 3.00 GHz model: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz gpus: - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 memory: available: 119.98 GB total: 125.78 GB used: 4.78 GB system: node: monster release: 4.15.0-167-generic system: Linux max_inference_memory: 736250368 params: 5288548 post_inference_memory: 21402112 pre_inference_memory: 21402112 timing: batch_size_1: cpu_to_gpu: human_readable: batch_latency: "144.815 \xB5s +/- 16.103 \xB5s [136.614 \xB5s, 272.751 \xB5\ s]" batches_per_second: 6.96 K +/- 535.06 [3.67 K, 7.32 K] metrics: batches_per_second_max: 7319.902268760908 batches_per_second_mean: 6962.865857677197 batches_per_second_min: 3666.3496503496503 batches_per_second_std: 535.0581873859935 seconds_per_batch_max: 0.0002727508544921875 seconds_per_batch_mean: 0.00014481544494628906 seconds_per_batch_min: 0.0001366138458251953 seconds_per_batch_std: 1.6102982159292097e-05 gpu_to_cpu: human_readable: batch_latency: "106.168 \xB5s +/- 17.829 \xB5s [53.167 \xB5s, 248.909 \xB5\ s]" batches_per_second: 9.64 K +/- 1.60 K [4.02 K, 18.81 K] metrics: batches_per_second_max: 18808.538116591928 batches_per_second_mean: 9639.942102368092 batches_per_second_min: 4017.532567049808 batches_per_second_std: 1595.7983033708472 seconds_per_batch_max: 0.00024890899658203125 seconds_per_batch_mean: 0.00010616779327392578 seconds_per_batch_min: 5.316734313964844e-05 seconds_per_batch_std: 1.7829135190772566e-05 on_device_inference: human_readable: batch_latency: "15.567 ms +/- 546.154 \xB5s [15.311 ms, 19.261 ms]" batches_per_second: 64.31 +/- 1.96 [51.92, 65.31] metrics: batches_per_second_max: 65.31149174711928 batches_per_second_mean: 64.30692850265713 batches_per_second_min: 51.918698784442846 batches_per_second_std: 1.9599322351815833 seconds_per_batch_max: 0.019260883331298828 seconds_per_batch_mean: 0.015567030906677246 seconds_per_batch_min: 0.015311241149902344 seconds_per_batch_std: 0.0005461537255227954 total: human_readable: batch_latency: "15.818 ms +/- 549.873 \xB5s [15.561 ms, 19.461 ms]" batches_per_second: 63.29 +/- 1.92 [51.38, 64.26] metrics: batches_per_second_max: 64.26476266356143 batches_per_second_mean: 63.28565696640637 batches_per_second_min: 51.38378232692614 batches_per_second_std: 1.9198343850767468 seconds_per_batch_max: 0.019461393356323242 seconds_per_batch_mean: 0.01581801414489746 seconds_per_batch_min: 0.015560626983642578 seconds_per_batch_std: 0.0005498731526138171 batch_size_8: cpu_to_gpu: human_readable: batch_latency: "805.674 \xB5s +/- 157.254 \xB5s [773.191 \xB5s, 2.303 ms]" batches_per_second: 1.26 K +/- 97.51 [434.24, 1.29 K] metrics: batches_per_second_max: 1293.3407338883749 batches_per_second_mean: 1259.5653105357776 batches_per_second_min: 434.23791282741485 batches_per_second_std: 97.51424036939879 seconds_per_batch_max: 0.002302885055541992 seconds_per_batch_mean: 0.000805673599243164 seconds_per_batch_min: 0.0007731914520263672 seconds_per_batch_std: 0.0001572538140613121 gpu_to_cpu: human_readable: batch_latency: "104.215 \xB5s +/- 12.658 \xB5s [59.605 \xB5s, 128.031 \xB5\ s]" batches_per_second: 9.81 K +/- 1.76 K [7.81 K, 16.78 K] metrics: batches_per_second_max: 16777.216 batches_per_second_mean: 9806.840626578907 batches_per_second_min: 7810.621973929236 batches_per_second_std: 1761.6008872740726 seconds_per_batch_max: 0.00012803077697753906 seconds_per_batch_mean: 0.00010421514511108399 seconds_per_batch_min: 5.9604644775390625e-05 seconds_per_batch_std: 1.2658293070174213e-05 on_device_inference: human_readable: batch_latency: "16.623 ms +/- 759.017 \xB5s [16.301 ms, 22.584 ms]" batches_per_second: 60.26 +/- 2.22 [44.28, 61.35] metrics: batches_per_second_max: 61.346243290283894 batches_per_second_mean: 60.25881046175457 batches_per_second_min: 44.27827629162004 batches_per_second_std: 2.2193085956672296 seconds_per_batch_max: 0.02258443832397461 seconds_per_batch_mean: 0.01662288188934326 seconds_per_batch_min: 0.01630091667175293 seconds_per_batch_std: 0.0007590167680596548 total: human_readable: batch_latency: "17.533 ms +/- 836.015 \xB5s [17.193 ms, 23.896 ms]" batches_per_second: 57.14 +/- 2.20 [41.85, 58.16] metrics: batches_per_second_max: 58.16374528511205 batches_per_second_mean: 57.140338855126565 batches_per_second_min: 41.84762740950632 batches_per_second_std: 2.1985066663972677 seconds_per_batch_max: 0.023896217346191406 seconds_per_batch_mean: 0.01753277063369751 seconds_per_batch_min: 0.017192840576171875 seconds_per_batch_std: 0.0008360147274630088 ```... Your turn
How we benchmark
The overall flow can be summarized with the diagram shown below (best viewed on GitHub): ```mermaid flowchart TB; A([Start]) --> B B(preparesamples) B --> C[getmachineinfo] C --> D[measureparams] D --> E[warmup, batchsize=1] E --> F[measure_flops]
subgraph SG[Repeat for batch_size 1 and x]
direction TB
G[measure_allocated_memory]
G --> H[warm_up, given batch_size]
H --> I[measure_detailed_inference_timing]
I --> J[measure_repeated_inference_timing]
J --> K[measure_energy]
end
F --> SG
SG --> END([End])
```
Usually, the sample and model don't reside on the same device initially (e.g., a GPU holds the model while the sample is on CPU after being loaded from disk or collected as live data). Accordingly, we measure timing in three parts: cpu_to_gpu, on_device_inference, and gpu_to_cpu, as well as a sum of the three, total. Note that the model.device() determines the execution device. The inference flow is shown below:
mermaid
flowchart LR;
A([sample])
A --> B[cpu -> gpu]
B --> C[model __call__]
C --> D[gpu -> cpu]
D --> E([result])
Advanced use
Trying to benchmark a custom class, which is not a torch.nn.Module?
You can pass custom functions to benchmark as seen in this example.
Limitations
- Allocated memory measurements are only available on CUDA devices.
- Energy consumption can only be measured on NVIDIA Jetson platforms at the moment.
- FLOPs and parameter count is not support for custom classes.
Acknowledgement
This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871449 (OpenDR). It was developed for benchmarking tools in OpenDR, a non-proprietary toolkit for deep learning based functionalities for robotics and vision.
Citation
If you like the tool and use it in research, please consider citing it:
bibtex
@software{hedegaard2022pytorchbenchmark,
author = {Hedegaard, Lukas},
doi = {10.5281/zenodo.7223585},
month = {10},
title = {{PyTorch-Benchmark}},
version = {0.3.5},
year = {2022}
}
Owner
- Name: Lukas Hedegaard
- Login: LukasHedegaard
- Kind: user
- Location: Aarhus, Denmark
- Company: Aarhus University
- Repositories: 42
- Profile: https://github.com/LukasHedegaard
Deep Learning Researcher
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Hedegaard
given-names: Lukas
orcid: https://orcid.org/0000-0002-2841-864X
title: "PyTorch-Benchmark"
version: 0.3.5
doi: 10.5281/zenodo.7223585
date-released: 2022-10-19
GitHub Events
Total
- Watch event: 22
- Fork event: 2
Last Year
- Watch event: 22
- Fork event: 2
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 51
- Total Committers: 2
- Avg Commits per committer: 25.5
- Development Distribution Score (DDS): 0.176
Top Committers
| Name | Commits | |
|---|---|---|
| LukasHedegaard | lh@e****k | 42 |
| Lukas Hedegaard | l****d@g****m | 9 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 7
- Total pull requests: 12
- Average time to close issues: 2 months
- Average time to close pull requests: about 1 hour
- Total issue authors: 6
- Total pull request authors: 1
- Average comments per issue: 1.43
- Average comments per pull request: 0.0
- Merged pull requests: 12
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- joepareti54 (2)
- mikasenghaas (1)
- jizongFox (1)
- rohitdavas (1)
- Bleach665 (1)
- alfonsocv12 (1)
Pull Request Authors
- LukasHedegaard (12)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 709 last-month
- Total dependent packages: 0
- Total dependent repositories: 3
- Total versions: 12
- Total maintainers: 1
pypi.org: pytorch-benchmark
Easily benchmark PyTorch model FLOPs, latency, throughput, max allocated memory and energy consumption in one go.
- Homepage: https://github.com/LukasHedegaard/pytorch-benchmark
- Documentation: https://pytorch-benchmark.readthedocs.io/
- License: Apache Software License
-
Latest release: 0.3.6
published over 2 years ago
Rankings
Maintainers (1)
Dependencies
- gputil >=1.4
- numpy *
- psutil >=5.9
- ptflops *
- py-cpuinfo >=7.0
- pyyaml >=6.0
- torch >=1.6
- tqdm *
- setuptools *
- twine *
- wheel *
- black * development
- flake8 * development
- flake8-black * development
- isort >=5.7 development
- numpy * development
- ptflops >=0.6 development
- pytest * development
- pytest-cov * development
- torchvision * development
- docutils >=0.16
- m2r2 >=0.2
- nbsphinx >=0.8
- pandoc >=1.0
- ride-sphinx-theme *
- sphinx >=3.0
- sphinx-autoapi >=1.7
- sphinx-autodoc-typehints >=1.0
- sphinx-copybutton >=0.3
- sphinx-paramlinks >=0.4.0
- sphinx-togglebutton >=0.2
- sphinxcontrib-fulltoc >=1.0
- sphinxcontrib-mockautodoc *
- actions/checkout master composite
- actions/setup-python master composite
- actions/checkout master composite
- actions/setup-python v1 composite
- pypa/gh-action-pypi-publish v1.1.0 composite
- actions/checkout master composite
- actions/setup-python v1 composite
- pypa/gh-action-pypi-publish v1.1.0 composite
- actions/checkout v2 composite
- actions/setup-python v1 composite