pytorch-benchmark

Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption

https://github.com/lukashedegaard/pytorch-benchmark

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Keywords

benchmark deep-learning flops gpu jetson python pytorch timing-analysis

Last synced: 6 months ago · JSON representation ·

Repository

Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption

Basic Info

Host: GitHub
Owner: LukasHedegaard
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 85.9 KB

Statistics

Stars: 103
Watchers: 3
Forks: 11
Open Issues: 3
Releases: 12

Topics

benchmark deep-learning flops gpu jetson python pytorch timing-analysis

Created about 4 years ago · Last pushed over 2 years ago

Metadata Files

Readme Changelog License Citation

⏱ pytorch-benchmark

Easily benchmark model inference FLOPs, latency, throughput, max allocated memory and energy consumption

*Actual coverage is higher as GPU-related code is skipped by Codecov

Install

bash pip install pytorch-benchmark

Usage

```python import torch from torchvision.models import efficientnetb0 from pytorchbenchmark import benchmark

model = efficientnetb0().to("cpu") # Model device sets benchmarking device sample = torch.randn(8, 3, 224, 224) # (B, C, H, W) results = benchmark(model, sample, numruns=100) ```

Sample results 💻

Macbook Pro (16-inch, 2019), 2.6 GHz 6-Core Intel Core i7

``` device: cpu flops: 401669732 machine_info: cpu: architecture: x86_64 cores: physical: 6 total: 12 frequency: 2.60 GHz model: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz gpus: null memory: available: 5.86 GB total: 16.00 GB used: 7.29 GB system: node: d40049 release: 21.2.0 system: Darwin params: 5288548 timing: batch_size_1: on_device_inference: human_readable: batch_latency: 74.439 ms +/- 6.459 ms [64.604 ms, 96.681 ms] batches_per_second: 13.53 +/- 1.09 [10.34, 15.48] metrics: batches_per_second_max: 15.478907181264278 batches_per_second_mean: 13.528026359855625 batches_per_second_min: 10.343281300091244 batches_per_second_std: 1.0922382209314958 seconds_per_batch_max: 0.09668111801147461 seconds_per_batch_mean: 0.07443853378295899 seconds_per_batch_min: 0.06460404396057129 seconds_per_batch_std: 0.006458734193132054 batch_size_8: on_device_inference: human_readable: batch_latency: 509.410 ms +/- 30.031 ms [405.296 ms, 621.773 ms] batches_per_second: 1.97 +/- 0.11 [1.61, 2.47] metrics: batches_per_second_max: 2.4673319862230025 batches_per_second_mean: 1.9696935126370148 batches_per_second_min: 1.6083039834656554 batches_per_second_std: 0.11341204895590185 seconds_per_batch_max: 0.6217730045318604 seconds_per_batch_mean: 0.509410228729248 seconds_per_batch_min: 0.40529608726501465 seconds_per_batch_std: 0.030031445467788704 ```

Server with NVIDIA GeForce RTX 2080 and Intel Xeon 2.10GHz CPU

``` device: cuda flops: 401669732 machine_info: cpu: architecture: x86_64 cores: physical: 16 total: 32 frequency: 3.00 GHz model: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz gpus: - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 memory: available: 119.98 GB total: 125.78 GB used: 4.78 GB system: node: monster release: 4.15.0-167-generic system: Linux max_inference_memory: 736250368 params: 5288548 post_inference_memory: 21402112 pre_inference_memory: 21402112 timing: batch_size_1: cpu_to_gpu: human_readable: batch_latency: "144.815 \xB5s +/- 16.103 \xB5s [136.614 \xB5s, 272.751 \xB5\ s]" batches_per_second: 6.96 K +/- 535.06 [3.67 K, 7.32 K] metrics: batches_per_second_max: 7319.902268760908 batches_per_second_mean: 6962.865857677197 batches_per_second_min: 3666.3496503496503 batches_per_second_std: 535.0581873859935 seconds_per_batch_max: 0.0002727508544921875 seconds_per_batch_mean: 0.00014481544494628906 seconds_per_batch_min: 0.0001366138458251953 seconds_per_batch_std: 1.6102982159292097e-05 gpu_to_cpu: human_readable: batch_latency: "106.168 \xB5s +/- 17.829 \xB5s [53.167 \xB5s, 248.909 \xB5\ s]" batches_per_second: 9.64 K +/- 1.60 K [4.02 K, 18.81 K] metrics: batches_per_second_max: 18808.538116591928 batches_per_second_mean: 9639.942102368092 batches_per_second_min: 4017.532567049808 batches_per_second_std: 1595.7983033708472 seconds_per_batch_max: 0.00024890899658203125 seconds_per_batch_mean: 0.00010616779327392578 seconds_per_batch_min: 5.316734313964844e-05 seconds_per_batch_std: 1.7829135190772566e-05 on_device_inference: human_readable: batch_latency: "15.567 ms +/- 546.154 \xB5s [15.311 ms, 19.261 ms]" batches_per_second: 64.31 +/- 1.96 [51.92, 65.31] metrics: batches_per_second_max: 65.31149174711928 batches_per_second_mean: 64.30692850265713 batches_per_second_min: 51.918698784442846 batches_per_second_std: 1.9599322351815833 seconds_per_batch_max: 0.019260883331298828 seconds_per_batch_mean: 0.015567030906677246 seconds_per_batch_min: 0.015311241149902344 seconds_per_batch_std: 0.0005461537255227954 total: human_readable: batch_latency: "15.818 ms +/- 549.873 \xB5s [15.561 ms, 19.461 ms]" batches_per_second: 63.29 +/- 1.92 [51.38, 64.26] metrics: batches_per_second_max: 64.26476266356143 batches_per_second_mean: 63.28565696640637 batches_per_second_min: 51.38378232692614 batches_per_second_std: 1.9198343850767468 seconds_per_batch_max: 0.019461393356323242 seconds_per_batch_mean: 0.01581801414489746 seconds_per_batch_min: 0.015560626983642578 seconds_per_batch_std: 0.0005498731526138171 batch_size_8: cpu_to_gpu: human_readable: batch_latency: "805.674 \xB5s +/- 157.254 \xB5s [773.191 \xB5s, 2.303 ms]" batches_per_second: 1.26 K +/- 97.51 [434.24, 1.29 K] metrics: batches_per_second_max: 1293.3407338883749 batches_per_second_mean: 1259.5653105357776 batches_per_second_min: 434.23791282741485 batches_per_second_std: 97.51424036939879 seconds_per_batch_max: 0.002302885055541992 seconds_per_batch_mean: 0.000805673599243164 seconds_per_batch_min: 0.0007731914520263672 seconds_per_batch_std: 0.0001572538140613121 gpu_to_cpu: human_readable: batch_latency: "104.215 \xB5s +/- 12.658 \xB5s [59.605 \xB5s, 128.031 \xB5\ s]" batches_per_second: 9.81 K +/- 1.76 K [7.81 K, 16.78 K] metrics: batches_per_second_max: 16777.216 batches_per_second_mean: 9806.840626578907 batches_per_second_min: 7810.621973929236 batches_per_second_std: 1761.6008872740726 seconds_per_batch_max: 0.00012803077697753906 seconds_per_batch_mean: 0.00010421514511108399 seconds_per_batch_min: 5.9604644775390625e-05 seconds_per_batch_std: 1.2658293070174213e-05 on_device_inference: human_readable: batch_latency: "16.623 ms +/- 759.017 \xB5s [16.301 ms, 22.584 ms]" batches_per_second: 60.26 +/- 2.22 [44.28, 61.35] metrics: batches_per_second_max: 61.346243290283894 batches_per_second_mean: 60.25881046175457 batches_per_second_min: 44.27827629162004 batches_per_second_std: 2.2193085956672296 seconds_per_batch_max: 0.02258443832397461 seconds_per_batch_mean: 0.01662288188934326 seconds_per_batch_min: 0.01630091667175293 seconds_per_batch_std: 0.0007590167680596548 total: human_readable: batch_latency: "17.533 ms +/- 836.015 \xB5s [17.193 ms, 23.896 ms]" batches_per_second: 57.14 +/- 2.20 [41.85, 58.16] metrics: batches_per_second_max: 58.16374528511205 batches_per_second_mean: 57.140338855126565 batches_per_second_min: 41.84762740950632 batches_per_second_std: 2.1985066663972677 seconds_per_batch_max: 0.023896217346191406 seconds_per_batch_mean: 0.01753277063369751 seconds_per_batch_min: 0.017192840576171875 seconds_per_batch_std: 0.0008360147274630088 ```

... Your turn

How we benchmark

The overall flow can be summarized with the diagram shown below (best viewed on GitHub): ```mermaid flowchart TB; A([Start]) --> B B(preparesamples) B --> C[getmachineinfo] C --> D[measureparams] D --> E[warmup, batchsize=1] E --> F[measure_flops]

subgraph SG[Repeat for batch_size 1 and x]
    direction TB
    G[measure_allocated_memory]
    G --> H[warm_up, given batch_size]
    H --> I[measure_detailed_inference_timing]
    I --> J[measure_repeated_inference_timing]
    J --> K[measure_energy]
end

F --> SG
SG --> END([End])

```

Usually, the sample and model don't reside on the same device initially (e.g., a GPU holds the model while the sample is on CPU after being loaded from disk or collected as live data). Accordingly, we measure timing in three parts: cpu_to_gpu, on_device_inference, and gpu_to_cpu, as well as a sum of the three, total. Note that the model.device() determines the execution device. The inference flow is shown below:

mermaid flowchart LR; A([sample]) A --> B[cpu -> gpu] B --> C[model __call__] C --> D[gpu -> cpu] D --> E([result])

Advanced use

Trying to benchmark a custom class, which is not a torch.nn.Module? You can pass custom functions to benchmark as seen in this example.

Limitations

Allocated memory measurements are only available on CUDA devices.
Energy consumption can only be measured on NVIDIA Jetson platforms at the moment.
FLOPs and parameter count is not support for custom classes.

Acknowledgement

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871449 (OpenDR). It was developed for benchmarking tools in OpenDR, a non-proprietary toolkit for deep learning based functionalities for robotics and vision.

Citation

If you like the tool and use it in research, please consider citing it: bibtex @software{hedegaard2022pytorchbenchmark, author = {Hedegaard, Lukas}, doi = {10.5281/zenodo.7223585}, month = {10}, title = {{PyTorch-Benchmark}}, version = {0.3.5}, year = {2022} }

Owner

Name: Lukas Hedegaard
Login: LukasHedegaard
Kind: user
Location: Aarhus, Denmark
Company: Aarhus University

Repositories: 42
Profile: https://github.com/LukasHedegaard

Deep Learning Researcher

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Hedegaard
    given-names: Lukas
    orcid: https://orcid.org/0000-0002-2841-864X
title: "PyTorch-Benchmark"
version: 0.3.5
doi: 10.5281/zenodo.7223585
date-released: 2022-10-19

GitHub Events

Total

Watch event: 22
Fork event: 2

Last Year

Watch event: 22
Fork event: 2

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 51
Total Committers: 2
Avg Commits per committer: 25.5
Development Distribution Score (DDS): 0.176

Top Committers

Name	Email	Commits
LukasHedegaard	lh@e****k	42
Lukas Hedegaard	l**d@g**m	9

Committer Domains (Top 20 + Academic)

eng.au.dk: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 7
Total pull requests: 12
Average time to close issues: 2 months
Average time to close pull requests: about 1 hour
Total issue authors: 6
Total pull request authors: 1
Average comments per issue: 1.43
Average comments per pull request: 0.0
Merged pull requests: 12
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

joepareti54 (2)
mikasenghaas (1)
jizongFox (1)
rohitdavas (1)
Bleach665 (1)
alfonsocv12 (1)

Pull Request Authors

LukasHedegaard (12)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 709 last-month

Total dependent packages: 0
Total dependent repositories: 3
Total versions: 12
Total maintainers: 1

pypi.org: pytorch-benchmark

Easily benchmark PyTorch model FLOPs, latency, throughput, max allocated memory and energy consumption in one go.

Homepage: https://github.com/LukasHedegaard/pytorch-benchmark
Documentation: https://pytorch-benchmark.readthedocs.io/
License: Apache Software License
Latest release: 0.3.6
published over 2 years ago

Versions: 12
Dependent Packages: 0
Dependent Repositories: 3
Downloads: 709 Last month

Rankings

Downloads: 8.9%

Dependent repos count: 9.0%

Stargazers count: 9.1%

Dependent packages count: 10.1%

Average: 10.1%

Forks count: 13.3%

Maintainers (1)

LukasHedegaard

Last synced: 6 months ago

Dependencies

requirements.txt pypi

gputil >=1.4
numpy *
psutil >=5.9
ptflops *
py-cpuinfo >=7.0
pyyaml >=6.0
torch >=1.6
tqdm *

requirements/build.txt pypi

setuptools *
twine *
wheel *

requirements/dev.txt pypi

black * development
flake8 * development
flake8-black * development
isort >=5.7 development
numpy * development
ptflops >=0.6 development
pytest * development
pytest-cov * development
torchvision * development

requirements/docs.txt pypi

docutils >=0.16
m2r2 >=0.2
nbsphinx >=0.8
pandoc >=1.0
ride-sphinx-theme *
sphinx >=3.0
sphinx-autoapi >=1.7
sphinx-autodoc-typehints >=1.0
sphinx-copybutton >=0.3
sphinx-paramlinks >=0.4.0
sphinx-togglebutton >=0.2
sphinxcontrib-fulltoc >=1.0
sphinxcontrib-mockautodoc *

.github/workflows/codecov.yml actions

actions/checkout master composite
actions/setup-python master composite

.github/workflows/publish.yml actions

actions/checkout master composite
actions/setup-python v1 composite
pypa/gh-action-pypi-publish v1.1.0 composite

.github/workflows/publishtest.yml actions

actions/checkout master composite
actions/setup-python v1 composite
pypa/gh-action-pypi-publish v1.1.0 composite

.github/workflows/pythonpackage.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite

setup.py pypi

pytorch-benchmark

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

⏱ pytorch-benchmark

*Actual coverage is higher as GPU-related code is skipped by Codecov

Install

Usage

Sample results 💻

How we benchmark

Advanced use

Limitations

Acknowledgement

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pytorch-benchmark

Rankings

Maintainers (1)

Dependencies