pytorch-benchmark

Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption

https://github.com/lukashedegaard/pytorch-benchmark

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary

Keywords

benchmark deep-learning flops gpu jetson python pytorch timing-analysis
Last synced: 6 months ago · JSON representation ·

Repository

Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption

Basic Info
  • Host: GitHub
  • Owner: LukasHedegaard
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 85.9 KB
Statistics
  • Stars: 103
  • Watchers: 3
  • Forks: 11
  • Open Issues: 3
  • Releases: 12
Topics
benchmark deep-learning flops gpu jetson python pytorch timing-analysis
Created about 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme Changelog License Citation

README.md

⏱ pytorch-benchmark

Easily benchmark model inference FLOPs, latency, throughput, max allocated memory and energy consumption

CodeFactor *

*Actual coverage is higher as GPU-related code is skipped by Codecov

Install

bash pip install pytorch-benchmark

Usage

```python import torch from torchvision.models import efficientnetb0 from pytorchbenchmark import benchmark

model = efficientnetb0().to("cpu") # Model device sets benchmarking device sample = torch.randn(8, 3, 224, 224) # (B, C, H, W) results = benchmark(model, sample, numruns=100) ```

Sample results 💻

Macbook Pro (16-inch, 2019), 2.6 GHz 6-Core Intel Core i7 ``` device: cpu flops: 401669732 machine_info: cpu: architecture: x86_64 cores: physical: 6 total: 12 frequency: 2.60 GHz model: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz gpus: null memory: available: 5.86 GB total: 16.00 GB used: 7.29 GB system: node: d40049 release: 21.2.0 system: Darwin params: 5288548 timing: batch_size_1: on_device_inference: human_readable: batch_latency: 74.439 ms +/- 6.459 ms [64.604 ms, 96.681 ms] batches_per_second: 13.53 +/- 1.09 [10.34, 15.48] metrics: batches_per_second_max: 15.478907181264278 batches_per_second_mean: 13.528026359855625 batches_per_second_min: 10.343281300091244 batches_per_second_std: 1.0922382209314958 seconds_per_batch_max: 0.09668111801147461 seconds_per_batch_mean: 0.07443853378295899 seconds_per_batch_min: 0.06460404396057129 seconds_per_batch_std: 0.006458734193132054 batch_size_8: on_device_inference: human_readable: batch_latency: 509.410 ms +/- 30.031 ms [405.296 ms, 621.773 ms] batches_per_second: 1.97 +/- 0.11 [1.61, 2.47] metrics: batches_per_second_max: 2.4673319862230025 batches_per_second_mean: 1.9696935126370148 batches_per_second_min: 1.6083039834656554 batches_per_second_std: 0.11341204895590185 seconds_per_batch_max: 0.6217730045318604 seconds_per_batch_mean: 0.509410228729248 seconds_per_batch_min: 0.40529608726501465 seconds_per_batch_std: 0.030031445467788704 ```
Server with NVIDIA GeForce RTX 2080 and Intel Xeon 2.10GHz CPU ``` device: cuda flops: 401669732 machine_info: cpu: architecture: x86_64 cores: physical: 16 total: 32 frequency: 3.00 GHz model: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz gpus: - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 - memory: 8192.0 MB name: NVIDIA GeForce RTX 2080 memory: available: 119.98 GB total: 125.78 GB used: 4.78 GB system: node: monster release: 4.15.0-167-generic system: Linux max_inference_memory: 736250368 params: 5288548 post_inference_memory: 21402112 pre_inference_memory: 21402112 timing: batch_size_1: cpu_to_gpu: human_readable: batch_latency: "144.815 \xB5s +/- 16.103 \xB5s [136.614 \xB5s, 272.751 \xB5\ s]" batches_per_second: 6.96 K +/- 535.06 [3.67 K, 7.32 K] metrics: batches_per_second_max: 7319.902268760908 batches_per_second_mean: 6962.865857677197 batches_per_second_min: 3666.3496503496503 batches_per_second_std: 535.0581873859935 seconds_per_batch_max: 0.0002727508544921875 seconds_per_batch_mean: 0.00014481544494628906 seconds_per_batch_min: 0.0001366138458251953 seconds_per_batch_std: 1.6102982159292097e-05 gpu_to_cpu: human_readable: batch_latency: "106.168 \xB5s +/- 17.829 \xB5s [53.167 \xB5s, 248.909 \xB5\ s]" batches_per_second: 9.64 K +/- 1.60 K [4.02 K, 18.81 K] metrics: batches_per_second_max: 18808.538116591928 batches_per_second_mean: 9639.942102368092 batches_per_second_min: 4017.532567049808 batches_per_second_std: 1595.7983033708472 seconds_per_batch_max: 0.00024890899658203125 seconds_per_batch_mean: 0.00010616779327392578 seconds_per_batch_min: 5.316734313964844e-05 seconds_per_batch_std: 1.7829135190772566e-05 on_device_inference: human_readable: batch_latency: "15.567 ms +/- 546.154 \xB5s [15.311 ms, 19.261 ms]" batches_per_second: 64.31 +/- 1.96 [51.92, 65.31] metrics: batches_per_second_max: 65.31149174711928 batches_per_second_mean: 64.30692850265713 batches_per_second_min: 51.918698784442846 batches_per_second_std: 1.9599322351815833 seconds_per_batch_max: 0.019260883331298828 seconds_per_batch_mean: 0.015567030906677246 seconds_per_batch_min: 0.015311241149902344 seconds_per_batch_std: 0.0005461537255227954 total: human_readable: batch_latency: "15.818 ms +/- 549.873 \xB5s [15.561 ms, 19.461 ms]" batches_per_second: 63.29 +/- 1.92 [51.38, 64.26] metrics: batches_per_second_max: 64.26476266356143 batches_per_second_mean: 63.28565696640637 batches_per_second_min: 51.38378232692614 batches_per_second_std: 1.9198343850767468 seconds_per_batch_max: 0.019461393356323242 seconds_per_batch_mean: 0.01581801414489746 seconds_per_batch_min: 0.015560626983642578 seconds_per_batch_std: 0.0005498731526138171 batch_size_8: cpu_to_gpu: human_readable: batch_latency: "805.674 \xB5s +/- 157.254 \xB5s [773.191 \xB5s, 2.303 ms]" batches_per_second: 1.26 K +/- 97.51 [434.24, 1.29 K] metrics: batches_per_second_max: 1293.3407338883749 batches_per_second_mean: 1259.5653105357776 batches_per_second_min: 434.23791282741485 batches_per_second_std: 97.51424036939879 seconds_per_batch_max: 0.002302885055541992 seconds_per_batch_mean: 0.000805673599243164 seconds_per_batch_min: 0.0007731914520263672 seconds_per_batch_std: 0.0001572538140613121 gpu_to_cpu: human_readable: batch_latency: "104.215 \xB5s +/- 12.658 \xB5s [59.605 \xB5s, 128.031 \xB5\ s]" batches_per_second: 9.81 K +/- 1.76 K [7.81 K, 16.78 K] metrics: batches_per_second_max: 16777.216 batches_per_second_mean: 9806.840626578907 batches_per_second_min: 7810.621973929236 batches_per_second_std: 1761.6008872740726 seconds_per_batch_max: 0.00012803077697753906 seconds_per_batch_mean: 0.00010421514511108399 seconds_per_batch_min: 5.9604644775390625e-05 seconds_per_batch_std: 1.2658293070174213e-05 on_device_inference: human_readable: batch_latency: "16.623 ms +/- 759.017 \xB5s [16.301 ms, 22.584 ms]" batches_per_second: 60.26 +/- 2.22 [44.28, 61.35] metrics: batches_per_second_max: 61.346243290283894 batches_per_second_mean: 60.25881046175457 batches_per_second_min: 44.27827629162004 batches_per_second_std: 2.2193085956672296 seconds_per_batch_max: 0.02258443832397461 seconds_per_batch_mean: 0.01662288188934326 seconds_per_batch_min: 0.01630091667175293 seconds_per_batch_std: 0.0007590167680596548 total: human_readable: batch_latency: "17.533 ms +/- 836.015 \xB5s [17.193 ms, 23.896 ms]" batches_per_second: 57.14 +/- 2.20 [41.85, 58.16] metrics: batches_per_second_max: 58.16374528511205 batches_per_second_mean: 57.140338855126565 batches_per_second_min: 41.84762740950632 batches_per_second_std: 2.1985066663972677 seconds_per_batch_max: 0.023896217346191406 seconds_per_batch_mean: 0.01753277063369751 seconds_per_batch_min: 0.017192840576171875 seconds_per_batch_std: 0.0008360147274630088 ```

... Your turn

How we benchmark

The overall flow can be summarized with the diagram shown below (best viewed on GitHub): ```mermaid flowchart TB; A([Start]) --> B B(preparesamples) B --> C[getmachineinfo] C --> D[measureparams] D --> E[warmup, batchsize=1] E --> F[measure_flops]

subgraph SG[Repeat for batch_size 1 and x]
    direction TB
    G[measure_allocated_memory]
    G --> H[warm_up, given batch_size]
    H --> I[measure_detailed_inference_timing]
    I --> J[measure_repeated_inference_timing]
    J --> K[measure_energy]
end

F --> SG
SG --> END([End])

```

Usually, the sample and model don't reside on the same device initially (e.g., a GPU holds the model while the sample is on CPU after being loaded from disk or collected as live data). Accordingly, we measure timing in three parts: cpu_to_gpu, on_device_inference, and gpu_to_cpu, as well as a sum of the three, total. Note that the model.device() determines the execution device. The inference flow is shown below:

mermaid flowchart LR; A([sample]) A --> B[cpu -> gpu] B --> C[model __call__] C --> D[gpu -> cpu] D --> E([result])

Advanced use

Trying to benchmark a custom class, which is not a torch.nn.Module? You can pass custom functions to benchmark as seen in this example.

Limitations

  • Allocated memory measurements are only available on CUDA devices.
  • Energy consumption can only be measured on NVIDIA Jetson platforms at the moment.
  • FLOPs and parameter count is not support for custom classes.

Acknowledgement

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871449 (OpenDR). It was developed for benchmarking tools in OpenDR, a non-proprietary toolkit for deep learning based functionalities for robotics and vision.

Citation

If you like the tool and use it in research, please consider citing it: bibtex @software{hedegaard2022pytorchbenchmark, author = {Hedegaard, Lukas}, doi = {10.5281/zenodo.7223585}, month = {10}, title = {{PyTorch-Benchmark}}, version = {0.3.5}, year = {2022} }

Owner

  • Name: Lukas Hedegaard
  • Login: LukasHedegaard
  • Kind: user
  • Location: Aarhus, Denmark
  • Company: Aarhus University

Deep Learning Researcher

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Hedegaard
    given-names: Lukas
    orcid: https://orcid.org/0000-0002-2841-864X
title: "PyTorch-Benchmark"
version: 0.3.5
doi: 10.5281/zenodo.7223585
date-released: 2022-10-19

GitHub Events

Total
  • Watch event: 22
  • Fork event: 2
Last Year
  • Watch event: 22
  • Fork event: 2

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 51
  • Total Committers: 2
  • Avg Commits per committer: 25.5
  • Development Distribution Score (DDS): 0.176
Top Committers
Name Email Commits
LukasHedegaard lh@e****k 42
Lukas Hedegaard l****d@g****m 9
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 7
  • Total pull requests: 12
  • Average time to close issues: 2 months
  • Average time to close pull requests: about 1 hour
  • Total issue authors: 6
  • Total pull request authors: 1
  • Average comments per issue: 1.43
  • Average comments per pull request: 0.0
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • joepareti54 (2)
  • mikasenghaas (1)
  • jizongFox (1)
  • rohitdavas (1)
  • Bleach665 (1)
  • alfonsocv12 (1)
Pull Request Authors
  • LukasHedegaard (12)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 709 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 3
  • Total versions: 12
  • Total maintainers: 1
pypi.org: pytorch-benchmark

Easily benchmark PyTorch model FLOPs, latency, throughput, max allocated memory and energy consumption in one go.

  • Versions: 12
  • Dependent Packages: 0
  • Dependent Repositories: 3
  • Downloads: 709 Last month
Rankings
Downloads: 8.9%
Dependent repos count: 9.0%
Stargazers count: 9.1%
Dependent packages count: 10.1%
Average: 10.1%
Forks count: 13.3%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • gputil >=1.4
  • numpy *
  • psutil >=5.9
  • ptflops *
  • py-cpuinfo >=7.0
  • pyyaml >=6.0
  • torch >=1.6
  • tqdm *
requirements/build.txt pypi
  • setuptools *
  • twine *
  • wheel *
requirements/dev.txt pypi
  • black * development
  • flake8 * development
  • flake8-black * development
  • isort >=5.7 development
  • numpy * development
  • ptflops >=0.6 development
  • pytest * development
  • pytest-cov * development
  • torchvision * development
requirements/docs.txt pypi
  • docutils >=0.16
  • m2r2 >=0.2
  • nbsphinx >=0.8
  • pandoc >=1.0
  • ride-sphinx-theme *
  • sphinx >=3.0
  • sphinx-autoapi >=1.7
  • sphinx-autodoc-typehints >=1.0
  • sphinx-copybutton >=0.3
  • sphinx-paramlinks >=0.4.0
  • sphinx-togglebutton >=0.2
  • sphinxcontrib-fulltoc >=1.0
  • sphinxcontrib-mockautodoc *
.github/workflows/codecov.yml actions
  • actions/checkout master composite
  • actions/setup-python master composite
.github/workflows/publish.yml actions
  • actions/checkout master composite
  • actions/setup-python v1 composite
  • pypa/gh-action-pypi-publish v1.1.0 composite
.github/workflows/publishtest.yml actions
  • actions/checkout master composite
  • actions/setup-python v1 composite
  • pypa/gh-action-pypi-publish v1.1.0 composite
.github/workflows/pythonpackage.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
setup.py pypi