Easily benchmark PyTorch model FLOPs, latency, throughput, max allocated memory and energy consumption

Last update: Dec 22, 2022

Related tags

Deep Learning pytorch-benchmark

Overview

⏱ pytorch-benchmark

Easily benchmark model inference FLOPs, latency, throughput, max allocated memory and energy consumption

Install

pip install pytorch-benchmark

Usage

import torch
from torchvision.models import efficientnet_b0
from pytorch_benchmark import benchmark


model = efficientnet_b0()
sample = torch.randn(8, 3, 224, 224)  # (B, C, H, W)
results = benchmark(model, sample, num_runs=100)

Sample results 💻

Macbook Pro (16-inch, 2019), 2.6 GHz 6-Core Intel Core i7

device: cpu
flops: 401669732
machine_info:
  cpu:
    architecture: x86_64
    cores:
      physical: 6
      total: 12
    frequency: 2.60 GHz
    model: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  gpus: null
  memory:
    available: 5.86 GB
    total: 16.00 GB
    used: 7.29 GB
  system:
    node: d40049
    release: 21.2.0
    system: Darwin
params: 5288548
timing:
  batch_size_1:
    on_device_inference:
      human_readable:
        batch_latency: 74.439 ms +/- 6.459 ms [64.604 ms, 96.681 ms]
        batches_per_second: 13.53 +/- 1.09 [10.34, 15.48]
      metrics:
        batches_per_second_max: 15.478907181264278
        batches_per_second_mean: 13.528026359855625
        batches_per_second_min: 10.343281300091244
        batches_per_second_std: 1.0922382209314958
        seconds_per_batch_max: 0.09668111801147461
        seconds_per_batch_mean: 0.07443853378295899
        seconds_per_batch_min: 0.06460404396057129
        seconds_per_batch_std: 0.006458734193132054
  batch_size_8:
    on_device_inference:
      human_readable:
        batch_latency: 509.410 ms +/- 30.031 ms [405.296 ms, 621.773 ms]
        batches_per_second: 1.97 +/- 0.11 [1.61, 2.47]
      metrics:
        batches_per_second_max: 2.4673319862230025
        batches_per_second_mean: 1.9696935126370148
        batches_per_second_min: 1.6083039834656554
        batches_per_second_std: 0.11341204895590185
        seconds_per_batch_max: 0.6217730045318604
        seconds_per_batch_mean: 0.509410228729248
        seconds_per_batch_min: 0.40529608726501465
        seconds_per_batch_std: 0.030031445467788704

Server with NVIDIA GeForce RTX 2080 and Intel Xeon 2.10GHz CPU

device: cuda
flops: 401669732
machine_info:
  cpu:
    architecture: x86_64
    cores:
      physical: 16
      total: 32
    frequency: 3.00 GHz
    model: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  gpus:
  - memory: 8192.0 MB
    name: NVIDIA GeForce RTX 2080
  - memory: 8192.0 MB
    name: NVIDIA GeForce RTX 2080
  - memory: 8192.0 MB
    name: NVIDIA GeForce RTX 2080
  - memory: 8192.0 MB
    name: NVIDIA GeForce RTX 2080
  memory:
    available: 119.98 GB
    total: 125.78 GB
    used: 4.78 GB
  system:
    node: monster
    release: 4.15.0-167-generic
    system: Linux
max_inference_memory: 736250368
params: 5288548
post_inference_memory: 21402112
pre_inference_memory: 21402112
timing:
  batch_size_1:
    cpu_to_gpu:
      human_readable:
        batch_latency: "144.815 \xB5s +/- 16.103 \xB5s [136.614 \xB5s, 272.751 \xB5\
          s]"
        batches_per_second: 6.96 K +/- 535.06 [3.67 K, 7.32 K]
      metrics:
        batches_per_second_max: 7319.902268760908
        batches_per_second_mean: 6962.865857677197
        batches_per_second_min: 3666.3496503496503
        batches_per_second_std: 535.0581873859935
        seconds_per_batch_max: 0.0002727508544921875
        seconds_per_batch_mean: 0.00014481544494628906
        seconds_per_batch_min: 0.0001366138458251953
        seconds_per_batch_std: 1.6102982159292097e-05
    gpu_to_cpu:
      human_readable:
        batch_latency: "106.168 \xB5s +/- 17.829 \xB5s [53.167 \xB5s, 248.909 \xB5\
          s]"
        batches_per_second: 9.64 K +/- 1.60 K [4.02 K, 18.81 K]
      metrics:
        batches_per_second_max: 18808.538116591928
        batches_per_second_mean: 9639.942102368092
        batches_per_second_min: 4017.532567049808
        batches_per_second_std: 1595.7983033708472
        seconds_per_batch_max: 0.00024890899658203125
        seconds_per_batch_mean: 0.00010616779327392578
        seconds_per_batch_min: 5.316734313964844e-05
        seconds_per_batch_std: 1.7829135190772566e-05
    on_device_inference:
      human_readable:
        batch_latency: "15.567 ms +/- 546.154 \xB5s [15.311 ms, 19.261 ms]"
        batches_per_second: 64.31 +/- 1.96 [51.92, 65.31]
      metrics:
        batches_per_second_max: 65.31149174711928
        batches_per_second_mean: 64.30692850265713
        batches_per_second_min: 51.918698784442846
        batches_per_second_std: 1.9599322351815833
        seconds_per_batch_max: 0.019260883331298828
        seconds_per_batch_mean: 0.015567030906677246
        seconds_per_batch_min: 0.015311241149902344
        seconds_per_batch_std: 0.0005461537255227954
    total:
      human_readable:
        batch_latency: "15.818 ms +/- 549.873 \xB5s [15.561 ms, 19.461 ms]"
        batches_per_second: 63.29 +/- 1.92 [51.38, 64.26]
      metrics:
        batches_per_second_max: 64.26476266356143
        batches_per_second_mean: 63.28565696640637
        batches_per_second_min: 51.38378232692614
        batches_per_second_std: 1.9198343850767468
        seconds_per_batch_max: 0.019461393356323242
        seconds_per_batch_mean: 0.01581801414489746
        seconds_per_batch_min: 0.015560626983642578
        seconds_per_batch_std: 0.0005498731526138171
  batch_size_8:
    cpu_to_gpu:
      human_readable:
        batch_latency: "805.674 \xB5s +/- 157.254 \xB5s [773.191 \xB5s, 2.303 ms]"
        batches_per_second: 1.26 K +/- 97.51 [434.24, 1.29 K]
      metrics:
        batches_per_second_max: 1293.3407338883749
        batches_per_second_mean: 1259.5653105357776
        batches_per_second_min: 434.23791282741485
        batches_per_second_std: 97.51424036939879
        seconds_per_batch_max: 0.002302885055541992
        seconds_per_batch_mean: 0.000805673599243164
        seconds_per_batch_min: 0.0007731914520263672
        seconds_per_batch_std: 0.0001572538140613121
    gpu_to_cpu:
      human_readable:
        batch_latency: "104.215 \xB5s +/- 12.658 \xB5s [59.605 \xB5s, 128.031 \xB5\
          s]"
        batches_per_second: 9.81 K +/- 1.76 K [7.81 K, 16.78 K]
      metrics:
        batches_per_second_max: 16777.216
        batches_per_second_mean: 9806.840626578907
        batches_per_second_min: 7810.621973929236
        batches_per_second_std: 1761.6008872740726
        seconds_per_batch_max: 0.00012803077697753906
        seconds_per_batch_mean: 0.00010421514511108399
        seconds_per_batch_min: 5.9604644775390625e-05
        seconds_per_batch_std: 1.2658293070174213e-05
    on_device_inference:
      human_readable:
        batch_latency: "16.623 ms +/- 759.017 \xB5s [16.301 ms, 22.584 ms]"
        batches_per_second: 60.26 +/- 2.22 [44.28, 61.35]
      metrics:
        batches_per_second_max: 61.346243290283894
        batches_per_second_mean: 60.25881046175457
        batches_per_second_min: 44.27827629162004
        batches_per_second_std: 2.2193085956672296
        seconds_per_batch_max: 0.02258443832397461
        seconds_per_batch_mean: 0.01662288188934326
        seconds_per_batch_min: 0.01630091667175293
        seconds_per_batch_std: 0.0007590167680596548
    total:
      human_readable:
        batch_latency: "17.533 ms +/- 836.015 \xB5s [17.193 ms, 23.896 ms]"
        batches_per_second: 57.14 +/- 2.20 [41.85, 58.16]
      metrics:
        batches_per_second_max: 58.16374528511205
        batches_per_second_mean: 57.140338855126565
        batches_per_second_min: 41.84762740950632
        batches_per_second_std: 2.1985066663972677
        seconds_per_batch_max: 0.023896217346191406
        seconds_per_batch_mean: 0.01753277063369751
        seconds_per_batch_min: 0.017192840576171875
        seconds_per_batch_std: 0.0008360147274630088

Limitations

Usage assumptions:

The model has as a __call__ method that takes the sample, i.e. model(sample).
The Model also works if the sample had a batch size of 1 (first dimension).

Feature limitations:

Allocated memory uses torch.cuda.max_memory_allocated, which is only available if the model resides on a CUDA device.
Energy consumption can only be measured on NVIDIA Jetson platforms at the moment.

Citation

If you like the tool and use it in you research, please consider citing it:

@article{hedegaard2022torchbenchmark,
  title={PyTorch Benchmark},
  author={Lukas Hedegaard},
  journal={GitHub. Note: https://github.com/LukasHedegaard/pytorch-benchmark},
  year={2022}
}

Comments

torch cuda synchronize on GPUs?

Hello,

Very happy to see your repo.

I have tested the code and found that for the GPU tests, there may lack of torch synchronize when computing the device time. I am not sure how this may impact the results but I think it would make difference.

What do you think?

Best,

opened by jizongFox 1

Releases(0.3.5)

0.3.5(Oct 19, 2022)

This release adds a DOI to the codebase
Source code(tar.gz)
Source code(zip)
0.3.4(Feb 22, 2022)
Fixed

Add missing memory to results.

Source code(tar.gz)
Source code(zip)
0.3.3(Feb 22, 2022)
Fixed

Memory measurement for bs=1

Source code(tar.gz)
Source code(zip)
0.3.2(Feb 18, 2022)
Fixed

Warm up batch size

Source code(tar.gz)
Source code(zip)
0.3.1(Feb 17, 2022)
### Removed

try_custom_warmup.

Added

warm_up_fn overload option.

Support for FLOPs count in torch.nn.Module with input other than Tensor.

Source code(tar.gz)
Source code(zip)
0.3.0(Feb 15, 2022)
Added

Memory measurement for each batch size.

Repeated energy measurement.

Changed

Number formatting to use u instead of µ.

Source code(tar.gz)
Source code(zip)
0.2.2(Feb 14, 2022)
Added

Option to redirect info prints.

Source code(tar.gz)
Source code(zip)
0.2.1(Feb 14, 2022)
Fixed

Added missing with torch.no_grad()

Source code(tar.gz)
Source code(zip)
0.2.0(Feb 11, 2022)
Added

Overloads for benchmark parameters and functions to allow benchmark of custom classes.

Source code(tar.gz)
Source code(zip)
0.1.2(Feb 10, 2022)
Fixed

GPU compatibility.

Removed

Carbon-tracker energy measurement. Library is still too immature at this point.

Source code(tar.gz)
Source code(zip)
0.1.1(Feb 10, 2022)
Added

Initial version.

Source code(tar.gz)
Source code(zip)

SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

24 May 20, 2022

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

78 Jan 7, 2023

Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Streaming speaker diarization Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation by Juan Manuel Coria, Hervé

187 Jan 6, 2023

Predict the latency time of the deep learning models

Deep Neural Network Prediction Step 1. Genernate random parameters and Run them sequentially : $ python3 collect_data.py -gp -ep -pp -pl pooling -num

1 Nov 12, 2021

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

180 Jan 5, 2023

This is the official repository for evaluation on the NoW Benchmark Dataset. The goal of the NoW benchmark is to introduce a standard evaluation metric to measure the accuracy and robustness of 3D face reconstruction methods from a single image under variations in viewing angle, lighting, and common occlusions.

NoW Evaluation This is the official repository for evaluation on the NoW Benchmark Dataset. The goal of the NoW benchmark is to introduce a standard e

71 Dec 30, 2022

PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

Code for On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models This repository will reproduce the main results from our pape

32 Nov 25, 2022

PyTorch code accompanying our paper on Maximum Entropy Generators for Energy-Based Models

Maximum Entropy Generators for Energy-Based Models All experiments have tensorboard visualizations for samples / density / train curves etc. To run th

135 Oct 27, 2022

In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Contrastive Learning of Object Representations Supervisor: Prof. Dr. Gemma Roig Institutions: Goethe University CVAI - Computational Vision & Artifici

6 Dec 8, 2022

Easily benchmark PyTorch model FLOPs, latency, throughput, max allocated memory and energy consumption

Related tags

Overview

⏱ pytorch-benchmark

Install

Usage

Sample results 💻

Limitations

Citation

You might also like...

SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Predict the latency time of the deep learning models

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

PyTorch code accompanying our paper on Maximum Entropy Generators for Energy-Based Models

In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Comments

torch cuda synchronize on GPUs?

Releases(0.3.5)

0.3.5(Oct 19, 2022)

0.3.4(Feb 22, 2022)

Fixed

0.3.3(Feb 22, 2022)

Fixed

0.3.2(Feb 18, 2022)

Fixed

0.3.1(Feb 17, 2022)

Added

0.3.0(Feb 15, 2022)

Added

Changed

0.2.2(Feb 14, 2022)

Added

0.2.1(Feb 14, 2022)

Fixed

0.2.0(Feb 11, 2022)

Added

0.1.2(Feb 10, 2022)

Fixed

Removed

0.1.1(Feb 10, 2022)

Added

Owner

Lukas Hedegaard

source code of “Visual Saliency Transformer” (ICCV2021)

This repository contains the official code of the paper Equivariant Subgraph Aggregation Networks (ICLR 2022)

Code for ICCV 2021 paper: ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators..

Eff video representation - Efficient video representation through neural fields

Large scale embeddings on a single machine.

Really awesome semantic segmentation

Score refinement for confidence-based 3D multi-object tracking

meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)

Pytorch implementation for the Temporal and Object Quantification Networks (TOQ-Nets).

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Pytorch implement of 'Unmixing based PAN guided fusion network for hyperspectral imagery'

BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

Speech recognition tool to convert audio to text transcripts, for Linux and Raspberry Pi.

YoloV5 implemented by TensorFlow2 , with support for training, evaluation and inference.

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Official PyTorch code of Holistic 3D Scene Understanding from a Single Image with Implicit Representation (CVPR 2021)

Codes for our paper The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders published to EMNLP 2021.

Simple improvement of VQVAE that allow to generate x2 sized images compared to baseline

Python implementation of Wu et al (2018)'s registration fusion

Dyalog-apl-docset - Dyalog APL Dash Docset Generator