Ppq - A powerful offline neural network quantization tool with custimized IR

Last update: Jan 03, 2023

Related tags

Overview

PPL Quantization Tool(PPL 量化工具)

PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool with custimized IR, executor, dispacher and optimization passes.

Features

Quantable graph, an quantization-oriented network representation.
Quantize with Cuda, quantization simulating are 3x ~ 50x faster than PyTorch.
Hardware-friendly, simulating calculations are mostly identical with hardware.
Multi-platform support.

Installation

To release the power of this advanced quantization tool, at least one CUDA computing device is required. Install CUDA from CUDA Toolkit, PPL Quantization Tool will use CUDA compiler to compile cuda kernels at runtime.

ATTENTION: For users of pytorch, pytorch might bring you a minimized CUDA libraries, which will not satisfy the requirement of this tool, you have to install CUDA from NVIDIA manually.

ATTENTION: Make sure your python version is >= 3.6.0. PPL Quantization Tool is written with dialects that only by python >= 3.6.0.

Install from source:

Run following code with your terminal(For windows user, use command line instead).

git clone https://github.com/openppl-public/ppq.git
cd ppq
python setup.py install

Wait for python finish its installation and pray for bug free.

Install from wheel:

Download compiled python wheel from follwoing links: PPL Quantization Tool
Run following command with your terminal or command line(windows): "pip install ppq.wheel", and pray for bug free.

Tutorials and Examples

User guide, system design doc can be found at /doc/pages/instructions of this repository, PPL Quantization Tool documents are written with pure html5.
Examples can be found at /ppq/samples.
Let's quantize your network with following code:

from ppq.api import export_ppq_graph, quantize_torch_model
from ppq import TargetPlatform

# quantize your model within one single line:
quantized = quantize_torch_model(
    model=model, calib_dataloader=calibration_dataloader,
    calib_steps=32, input_shape=(1, 3, 224, 224),
    setting=quant_setting, collate_fn=collate_fn,
    platform=TargetPlatform.PPL_CUDA_INT8,
    device=DEVICE, verbose=0)

# export quantized graph with another line:
export_ppq_graph(
    graph=quantized, platform=TargetPlatform.PPL_CUDA_INT8,
    graph_save_to='Output/quantized(onnx).onnx',
    config_save_to='Output/quantized(onnx).json')

Contact Us

WeChat Official Account	QQ Group
OpenPPL	627853444

Email: [email protected]

Other Resources

Contributions

We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.

If you plan to contribute new features, utility functions, or extensions to the core, please first open an issue and discuss the feature with us. Sending a PR without discussion might end up resulting in a rejected PR because we might be taking the core in a different direction than you might be aware of.

Benchmark

PPQ is tested with models from mmlab-classification, mmlab-detection, mmlab-segamentation, mmlab-editing, here we listed part of out testing result.

No quantization optimization procedure is applied with following models.

Model	Type	Calibration	Dispatcher	Metric	PPQ(sim)	PPLCUDA	FP32
Resnet-18	Classification	512 imgs	conservative	Acc-Top-1	69.50%	69.42%	69.88%
ResNeXt-101	Classification	512 imgs	conservative	Acc-Top-1	78.46%	78.37%	78.66%
SE-ResNet-50	Classification	512 imgs	conservative	Acc-Top-1	77.24%	77.26%	77.76%
ShuffleNetV2	Classification	512 imgs	conservative	Acc-Top-1	69.13%	68.85%	69.55%
MobileNetV2	Classification	512 imgs	conservative	Acc-Top-1	70.99%	71.1%	71.88%
----	----	----	----	----	----	----	----
retinanet	Detection	32 imgs	pplnn	bbox_mAP	36.1%	36.1%	36.4%
faster_rcnn	Detection	32 imgs	pplnn	bbox_mAP	36.6%	36.7%	37.0%
fsaf	Detection	32 imgs	pplnn	bbox_mAP	36.5%	36.6%	37.4%
mask_rcnn	Detection	32 imgs	pplnn	bbox_mAP	37.7%	37.6%	37.9%
----	----	----	----	----	----	----	----
deeplabv3	Segamentation	32 imgs	conservative	aAcc / mIoU	96.13% / 78.81%	96.14% / 78.89%	96.17% / 79.12%
deeplabv3plus	Segamentation	32 imgs	conservative	aAcc / mIoU	96.27% / 79.39%	96.26% / 79.29%	96.29% / 79.60%
fcn	Segamentation	32 imgs	conservative	aAcc / mIoU	95.75% / 74.56%	95.62% / 73.96%	95.68% / 72.35%
pspnet	Segamentation	32 imgs	conservative	aAcc / mIoU	95.79% / 77.40%	95.79% / 77.41%	95.83% / 77.74%
----	----	----	----	----	----	----	----
srcnn	Editing	32 imgs	conservative	PSNR / SSIM	27.88% / 79.70%	27.88% / 79.07%	28.41% / 81.06%
esrgan	Editing	32 imgs	conservative	PSNR / SSIM	27.84% / 75.20%	27.49% / 72.90%	27.51% / 72.84%

PPQ(sim) stands for PPQ quantization simulator's result.
Dispatcher stands for dispatching policy of PPQ.
Classification models are evaluated with ImageNet, Detection and Segamentation models are evaluated with COCO dataset, Editing models are evaluated with DIV2K dataset.
All calibration datasets are randomly picked from training data.

License

This project is distributed under the Apache License, Version 2.0.

Comments

PPQ can not complie cuda extensions, please check your compiler and system environment, PPQ will disable CUDA KERNEL for now.

RTX2080Ti Python 3.8.13 ninja 1.5.1 ppq 0.6.4 PyTorch 1.12.0 tensorrt 8.4.1.5 export PATH=/usr/local/cuda-11.1/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH

When import ppq, it raised this prompt message. Could you please give some kind advice? @zchrissirhcz @ouonline

opened by songkq 26
使用CPU执行时报错

我又来了。我尝试在CPU跑ONNX官网model zoo 的 efficientnet-lite4-11.onnx 模型有报错。calibration策略为kl、mse时，quantization/optim/refine.py #582行触发assert，说某算子没有被正确quantize。我用minmax策略的时候就不会出现这个问题。上述都是在CPU条件下进行的（我这边条件没有GPUhhhh），，

我能通过改动某些代码来解决这个报错吗，还是说我只能先在CPU条件下用minmax策略勒

opened by Menace-Dragon 18
AttributeError: 'Operation' object has no attribute 'config'

Traceback (most recent call last): File "ProgramEntrance.py", line 200, in export_ppq_graph( File "/root/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/ppq-0.6.5.1-py3.8.egg/ppq/api/interface.py", line 628, in export_ppq_graph exporter.export(file_path=graph_save_to, config_path=config_save_to, graph=graph, **kwargs) File "/root/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/ppq-0.6.5.1-py3.8.egg/ppq/parser/trt_exporter.py", line 53, in export self.export_quantization_config(config_path, graph) File "/root/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/ppq-0.6.5.1-py3.8.egg/ppq/parser/trt_exporter.py", line 29, in export_quantization_config input_cfg = op.config.input_quantization_config[0] AttributeError: 'Operation' object has no attribute 'config' 在ProgramEntrance.py中调用trt_int8出错，但是调用PPL_CUDA_INT8却没有这个问题

opened by kaizhong2021 14
关于scheduler/dispatcher.py 125行处的bug
项目很不错！但是我在跑ONNX官网model zoo 的 efficientnet-lite4-11.onnx 模型有报错。报错在scheduler/dispatcher.py 125行。分析了一下原因是这样：

该模型的graph里有这么一个流：···-->Conv-->BN-->Clip-->···。PPQ会默认 fuse ConvBN，但是fuse得到的operation 是 append 到 graph.operations末尾的。

在给Clip绑定platform时，会执行scheduler/dispatcher.py 125行的语句。

综合1、2，也就是说，此时dispatching_table 是没有ConvBN这个operation的信息的，就会导致报错。顺序上的问题，看作者您怎么解决为好
opened by Menace-Dragon 11
RuntimeError: Error happens when dealing with operation ConstantOfShape_1246(TargetPlatform.SOI)

我执行了ppq/samples/Tutorial/quantize.py，使用模型是swin-transformer，target platform是TRT_INT8，出现了如下报错： Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/executor/torch.py", line 541, in __forward outputs = operation_forward_func(operation, inputs, self._executing_context) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/executor/op/torch/default.py", line 1197, in ConstantOfShape_forward output = torch.Tensor().new_full( TypeError: new_full(): argument 'size' must be tuple of ints, not list

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "quantize_test.py", line 71, in quantized = quantize_onnx_model( File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/core/defs.py", line 54, in _wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/api/interface.py", line 259, in quantize_onnx_model quantizer.quantize( File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/core/defs.py", line 54, in _wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/quantization/quantizer/base.py", line 61, in quantize executor.tracing_operation_meta(inputs=inputs) File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/core/defs.py", line 54, in _wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/executor/torch.py", line 603, in tracing_operation_meta self.__forward( File "/usr/local/lib/python3.8/dist-packages/ppq-0.6.6-py3.8.egg/ppq/executor/torch.py", line 568, in __forward raise RuntimeError(f'Error happens when dealing with operation {str(operation)}') from _ RuntimeError: Error happens when dealing with operation ConstantOfShape_1246(TargetPlatform.SOI) - inputs:['onnx::ConstantOfShape_4472'], outputs:['onnx::Concat_4473']

opened by shhn1 9
PPQ INT8 导出到 onnx 平台性能较低的可能原因

之前在另一个 issue #329 下面提到了，感觉还是另开一个新 issue 比较好。

目前ppq在int8上表现不好的原因有可能就是因为导出格式的问题所导致的。因为相比未量化的模型，ppq导出的模型不仅额外引入了量化与反量化的操作，还使用了fp32进行计算。这里onnx官方的int8模型并不是这样导出的，而是使用了 QLinearConv 等量化数据专用算子进行运算。这里以 mobilenet 为例，下载仓库在这里。如图所示的是未量化的模型和官方量化的模型：

下图是ppq在指定导出平台为 ONNXRUNTIME 量化得到的模型：

可以看到，官方实现由于使用了适用于量化后数据的 QLinearConv 等算子，并没有频繁的插入量化和反量化算子，同时还考虑到了图优化（比如他把 clip 给优化了，但这里还有可能能优化的点在于这种 QConv 接 QConv 可能也是可以融合的），但 ppq 不仅使用了非常多的量化和反量化算子，还使用了 fp32 进行实际运算。这可能是导致 ppq 导出的 onnx 模型低效的原因。

~~这只是一个猜想，不一定对（逃~~

opened by Pzzzzz5142 8
PPL_DSP_INT8量化后export问题

GPU模式下，在跑RetinaFace（backbone为ResNet50时），量化过程成功跑完，在导出时报TypeError: Cannot convert Resize_133 to caffe op。debug发现是因为没有满足ppq/parser/caffe/caffe_export_utils.py 的第439行判断而导致的。

opened by Menace-Dragon 8
量化后模型，转换为 tensorrt int8 engine，inference 不对齐

Hello, 我在尝试使用 PPQ 量化来得到 Tensorrt Int8 模型，发现模型比较大的时候，QDQ Onnx 模型转 TRT Int8 似乎存在性能问题 (无法对齐)，具体地，我尝试小模型如 mnist 时可以对齐 (1e-7量级误差)，稍大的模型如 resnet50 就存在较大的误差我不确定是否我的操作存在问题，目前定位问题倾向于认为是 Tensorrt 转换过程引入了误差，所以我在 TensorRT repo 中提了 Issue，详见 https://github.com/NVIDIA/TensorRT/issues/2103 想请教一下是否遇到过类似的问题，谢谢！

opened by FreemanHsu 7
Upsample算子似乎不支持量化；ConTranpose似乎无法完成BN Fold；

a. 在跑一个onnx测试模型时，报错Upsample算子 no bakend on target platform，似乎PPQ还不支持Upsample算子的量化，后续有可能会支持吗

b. 在跑另一个onnx测试模型时，模型中有这么个计算图： ...-->ConvTranspose-->BatchNrom-->ReLU-->... 然后报错ConvTranspose算子无法和BN进行Fold

c. 还想叨扰请教一下，如果计算图为： ...-->BatchNrom-->Conv-->ReLU-->... 那么可以进行Fold吗？

opened by Menace-Dragon 7

无法正确的获取到 bn 的 output

报错如下

  File "/workspace/ppq/ppq/IR/morph.py", line 275, in format_sng_bn
    bn_out_var = bn_op.outputs[0]
IndexError: list index out of range

模型是onnx官方提供的模型

代码是用的实例的代码，只修改的 device 部分，因为我现在没有 nv 的卡。

# ---------------------------------------------------------------
# 这个脚本向你展示了如何使用 onnxruntime 对 PPQ 导出的模型进行推理
# 你需要注意，Onnxruntime 可以运行各种各样的量化方案，但模型量化对 Onnxruntime 而言几乎无法起到加速作用
# 你可以使用 Onnxruntime 来验证量化方案以及 ppq 量化的正确性，但这不是一个合理的部署平台
# 修改 QUANT_PLATFROM 来使用不同的量化方案。

# This Script export ppq internal graph to onnxruntime,
# you should notice that onnx is designed as an Open Neural Network Exchange format.
# It has the capbility to describe most of ppq's quantization policy including combinations of:
#   Symmtrical, Asymmtrical, POT, Per-channel, Per-Layer
# However onnxruntime can not accelerate quantized model in most cases,
# you are supposed to use onnxruntime for verifying your network quantization result only.
# ---------------------------------------------------------------

# For this onnx inference test, all test data is randomly picked.
# If you want to use real data, just rewrite the defination of SAMPLES
import onnxruntime
import torch
from ppq import *
from ppq.api import *
from tqdm import tqdm

QUANT_PLATFROM = TargetPlatform.TRT_INT8
MODEL = "converted.onnx"
INPUT_SHAPE = [1, 3, 480, 640]
SAMPLES = [
    torch.rand(size=INPUT_SHAPE) for _ in range(256)
]  # rewirte this to use real data.
DEVICE = "cpu"
FINETUNE = True
QS = QuantizationSettingFactory.default_setting()
EXECUTING_DEVICE = "cpu"
REQUIRE_ANALYSE = True

# -------------------------------------------------------------------
# 下面向你展示了常用参数调节选项：
# -------------------------------------------------------------------
QS.lsq_optimization = FINETUNE  # 启动网络再训练过程，降低量化误差
QS.lsq_optimization_setting.steps = 500  # 再训练步数，影响训练时间，500 步大概几分钟
QS.lsq_optimization_setting.collecting_device = (
    "cpu"  # 缓存数据放在那，cuda 就是放在 gpu，如果显存超了你就换成 'cpu'
)

if QUANT_PLATFROM in {
    TargetPlatform.PPL_DSP_INT8,  # 这些平台是 per tensor 量化的
    TargetPlatform.HEXAGON_INT8,
    TargetPlatform.SNPE_INT8,
    TargetPlatform.METAX_INT8_T,
    TargetPlatform.FPGA_INT8,
}:
    QS.equalization = True  # per tensor 量化平台需要做 equalization

if QUANT_PLATFROM in {
    TargetPlatform.PPL_CUDA_INT8,  # 注意做这件事之前你需要确保你的执行框架具有混合精度执行的能力，以及浮点计算的能力
    TargetPlatform.TRT_INT8,
}:
    QS.dispatching_table.append(operation="OP NAME", platform=TargetPlatform.FP32)

print("正准备量化你的网络，检查下列设置:")
print(f"TARGET PLATFORM      : {QUANT_PLATFROM.name}")
print(f"NETWORK INPUTSHAPE   : {INPUT_SHAPE}")

# ENABLE CUDA KERNEL 会加速量化效率 3x ~ 10x，但是你如果没有装相应编译环境的话是编译不了的
# 你可以尝试安装编译环境，或者在不启动 CUDA KERNEL 的情况下完成量化：移除 with ENABLE_CUDA_KERNEL(): 即可
# with ENABLE_CUDA_KERNEL():
with open("a", "w") as fl:
    qir = quantize_onnx_model(
        onnx_import_file=MODEL,
        calib_dataloader=SAMPLES,
        calib_steps=128,
        setting=QS,
        input_shape=INPUT_SHAPE,
        collate_fn=lambda x: x.to(EXECUTING_DEVICE),
        platform=QUANT_PLATFROM,
        do_quantize=True,
    )

    # -------------------------------------------------------------------
    # PPQ 计算量化误差时，使用信噪比的倒数作为指标，即噪声能量 / 信号能量
    # 量化误差 0.1 表示在整体信号中，量化噪声的能量约为 10%
    # 你应当注意，在 graphwise_error_analyse 分析中，我们衡量的是累计误差
    # 网络的最后一层往往都具有较大的累计误差，这些误差是其前面的所有层所共同造成的
    # 你需要使用 layerwise_error_analyse 逐层分析误差的来源
    # -------------------------------------------------------------------
    print("正计算网络量化误差(SNR)，最后一层的误差应小于 0.1 以保证量化精度:")
    reports = graphwise_error_analyse(
        graph=qir,
        running_device=EXECUTING_DEVICE,
        steps=32,
        dataloader=SAMPLES,
        collate_fn=lambda x: x.to(EXECUTING_DEVICE),
    )
    for op, snr in reports.items():
        if snr > 0.1:
            ppq_warning(f"层 {op} 的累计量化误差显著，请考虑进行优化")

    if REQUIRE_ANALYSE:
        print("正计算逐层量化误差(SNR)，每一层的独立量化误差应小于 0.1 以保证量化精度:")
        layerwise_error_analyse(
            graph=qir,
            running_device=EXECUTING_DEVICE,
            interested_outputs=None,
            dataloader=SAMPLES,
            collate_fn=lambda x: x.to(EXECUTING_DEVICE),
        )

    print("网络量化结束，正在生成目标文件:")
    export_ppq_graph(
        graph=qir, platform=QUANT_PLATFROM, graph_save_to="model_int8.onnx"
    )

    exit(0)

    # -------------------------------------------------------------------
    # 记录一下输入输出的名字，onnxruntime 跑的时候需要提供这些名字
    # 我写的只是单输出单输入的版本，多输出多输入你得自己改改
    # -------------------------------------------------------------------
    int8_input_names = [name for name, _ in qir.inputs.items()]
    int8_output_names = [name for name, _ in qir.outputs.items()]

    # -------------------------------------------------------------------
    # 启动 onnxruntime 进行推理
    # 截止 2022.05， onnxruntime 跑 int8 很慢的，你就别期待它会很快了。
    # 如果你知道怎么让它跑的快点，或者onnxruntime更新了，你可以随时联系我。
    # -------------------------------------------------------------------
    session = onnxruntime.InferenceSession(
        "model_int8.onnx", providers=["CUDAExecutionProvider"]
    )
    onnxruntime_results = []
    for sample in tqdm(
        SAMPLES, desc="ONNXRUNTIME GENERATEING OUTPUTS", total=len(SAMPLES)
    ):
        result = session.run(None, {int8_input_names[0]: convert_any_to_numpy(sample)})
        onnxruntime_results.append(result)

同时，我也对那个 opset 做了转换，转换到了12，但还是没有办法读到 bn 层的 output。然后我希望能够将一个 onnx 模型量化到 onnx 格式，请问一下该怎么做呢？我看了其他的issue好像是把 QUANT_PLATFROM 设置为 TargetPlatform.ONNXRUNTIME，但看起来目前的版本并不支持这个平台。

opened by Pzzzzz5142 6

RuntimeError of Shape op during Calibration dataset progress and finetune progress

配置信息：

TARGET_PLATFORM = TargetPlatform.NXP_INT8 # choose your target platform MODEL_TYPE = NetworkFramework.ONNX # or NetworkFramework.CAFFE INPUT_LAYOUT = 'chw' # input data layout, chw or hwc NETWORK_INPUTSHAPE = [16, 1, 40, 61] # input shape of your network CALIBRATION_BATCHSIZE = 16 # batchsize of calibration dataset EXECUTING_DEVICE = 'cuda' # 'cuda' or 'cpu'. REQUIRE_ANALYSE = True DUMP_RESULT = False

SETTING = UnbelievableUserFriendlyQuantizationSetting( platform = TARGET_PLATFORM, finetune_steps = 2500, finetune_lr = 1e-3, calibration = 'percentile', equalization = True, non_quantable_op = None) dataloader = DataLoader( dataset=calibration_dataset, batch_size=32, shuffle=True) quantized = quantize( working_directory=WORKING_DIRECTORY, setting=SETTING, model_type=MODEL_TYPE, executing_device=EXECUTING_DEVICE, input_shape=NETWORK_INPUTSHAPE, target_platform=TARGET_PLATFORM, dataloader=dataloader, calib_steps=250)

问题描述：

在213次迭代时shape算子报上述错误，计算后发现这一次迭代batch size=19, 在dataload迭代器内部打印了下log，发现这一批次finetune确实只送出来了19个样本。后来发现数据集样本数刚好在213次迭代时遍历完一遍。后面我将finetune step和calib_step都改为100， Calibration数据集样本数调整为32*100个之后就能正常运行。下面是模型文件： model.zip

opened by lycfly 6

parse onnx model failed

problem:

[email protected]:~/workspace$ ~/libraries/TensorRT-8.4.1.5/bin/trtexec --onnx=./unet-q.onnx --saveEngine=unet-int8.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /home/kls/libraries/TensorRT-8.4.1.5/bin/trtexec --onnx=./unet-q.onnx --saveEngine=unet-int8.engine
[01/06/2023-17:32:20] [I] === Model Options ===
[01/06/2023-17:32:20] [I] Format: ONNX
[01/06/2023-17:32:20] [I] Model: ./unet-q.onnx
[01/06/2023-17:32:20] [I] Output:
[01/06/2023-17:32:20] [I] === Build Options ===
[01/06/2023-17:32:20] [I] Max batch: explicit batch
[01/06/2023-17:32:20] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[01/06/2023-17:32:20] [I] minTiming: 1
[01/06/2023-17:32:20] [I] avgTiming: 8
[01/06/2023-17:32:20] [I] Precision: FP32
[01/06/2023-17:32:20] [I] LayerPrecisions: 
[01/06/2023-17:32:20] [I] Calibration: 
[01/06/2023-17:32:20] [I] Refit: Disabled
[01/06/2023-17:32:20] [I] Sparsity: Disabled
[01/06/2023-17:32:20] [I] Safe mode: Disabled
[01/06/2023-17:32:20] [I] DirectIO mode: Disabled
[01/06/2023-17:32:20] [I] Restricted mode: Disabled
[01/06/2023-17:32:20] [I] Build only: Disabled
[01/06/2023-17:32:20] [I] Save engine: unet-int8.engine
[01/06/2023-17:32:20] [I] Load engine: 
[01/06/2023-17:32:20] [I] Profiling verbosity: 0
[01/06/2023-17:32:20] [I] Tactic sources: Using default tactic sources
[01/06/2023-17:32:20] [I] timingCacheMode: local
[01/06/2023-17:32:20] [I] timingCacheFile: 
[01/06/2023-17:32:20] [I] Input(s)s format: fp32:CHW
[01/06/2023-17:32:20] [I] Output(s)s format: fp32:CHW
[01/06/2023-17:32:20] [I] Input build shapes: model
[01/06/2023-17:32:20] [I] Input calibration shapes: model
[01/06/2023-17:32:20] [I] === System Options ===
[01/06/2023-17:32:20] [I] Device: 0
[01/06/2023-17:32:20] [I] DLACore: 
[01/06/2023-17:32:20] [I] Plugins:
[01/06/2023-17:32:20] [I] === Inference Options ===
[01/06/2023-17:32:20] [I] Batch: Explicit
[01/06/2023-17:32:20] [I] Input inference shapes: model
[01/06/2023-17:32:20] [I] Iterations: 10
[01/06/2023-17:32:20] [I] Duration: 3s (+ 200ms warm up)
[01/06/2023-17:32:20] [I] Sleep time: 0ms
[01/06/2023-17:32:20] [I] Idle time: 0ms
[01/06/2023-17:32:20] [I] Streams: 1
[01/06/2023-17:32:20] [I] ExposeDMA: Disabled
[01/06/2023-17:32:20] [I] Data transfers: Enabled
[01/06/2023-17:32:20] [I] Spin-wait: Disabled
[01/06/2023-17:32:20] [I] Multithreading: Disabled
[01/06/2023-17:32:20] [I] CUDA Graph: Disabled
[01/06/2023-17:32:20] [I] Separate profiling: Disabled
[01/06/2023-17:32:20] [I] Time Deserialize: Disabled
[01/06/2023-17:32:20] [I] Time Refit: Disabled
[01/06/2023-17:32:20] [I] Inputs:
[01/06/2023-17:32:20] [I] === Reporting Options ===
[01/06/2023-17:32:20] [I] Verbose: Disabled
[01/06/2023-17:32:20] [I] Averages: 10 inferences
[01/06/2023-17:32:20] [I] Percentile: 99
[01/06/2023-17:32:20] [I] Dump refittable layers:Disabled
[01/06/2023-17:32:20] [I] Dump output: Disabled
[01/06/2023-17:32:20] [I] Profile: Disabled
[01/06/2023-17:32:20] [I] Export timing to JSON file: 
[01/06/2023-17:32:20] [I] Export output to JSON file: 
[01/06/2023-17:32:20] [I] Export profile to JSON file: 
[01/06/2023-17:32:20] [I] 
[01/06/2023-17:32:20] [I] === Device Information ===
[01/06/2023-17:32:20] [I] Selected Device: NVIDIA A10
[01/06/2023-17:32:20] [I] Compute Capability: 8.6
[01/06/2023-17:32:20] [I] SMs: 72
[01/06/2023-17:32:20] [I] Compute Clock Rate: 1.695 GHz
[01/06/2023-17:32:20] [I] Device Global Memory: 22731 MiB
[01/06/2023-17:32:20] [I] Shared Memory per SM: 100 KiB
[01/06/2023-17:32:20] [I] Memory Bus Width: 384 bits (ECC enabled)
[01/06/2023-17:32:20] [I] Memory Clock Rate: 6.251 GHz
[01/06/2023-17:32:20] [I] 
[01/06/2023-17:32:20] [I] TensorRT version: 8.4.1
[01/06/2023-17:32:21] [I] [TRT] [MemUsageChange] Init CUDA: CPU +535, GPU +0, now: CPU 542, GPU 499 (MiB)
[01/06/2023-17:32:21] [I] Start parsing network model
[01/06/2023-17:32:21] [I] [TRT] ----------------------------------------------------------------
[01/06/2023-17:32:21] [I] [TRT] Input filename:   ./unet-q.onnx
[01/06/2023-17:32:21] [I] [TRT] ONNX IR version:  0.0.7
[01/06/2023-17:32:21] [I] [TRT] Opset version:    13
[01/06/2023-17:32:21] [I] [TRT] Producer name:    PPL Quantization Tool
[01/06/2023-17:32:21] [I] [TRT] Producer version: 
[01/06/2023-17:32:21] [I] [TRT] Domain:           
[01/06/2023-17:32:21] [I] [TRT] Model version:    0
[01/06/2023-17:32:21] [I] [TRT] Doc string:       
[01/06/2023-17:32:21] [I] [TRT] ----------------------------------------------------------------
[01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:720: While parsing node number 23 [QuantizeLinear -> "PPQ_Variable_297"]:
[01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:721: --- Begin node ---
[01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:722: input: "outc.conv.weight"
input: "PPQ_Variable_295"
input: "PPQ_Variable_296"
output: "PPQ_Variable_297"
name: "PPQ_Operation_98"
op_type: "QuantizeLinear"
attribute {
  name: "axis"
  i: 0
  type: INT
}

[01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:723: --- End node ---
[01/06/2023-17:32:21] [E] [TRT] ModelImporter.cpp:726: ERROR: builtin_op_importers.cpp:1096 In function QuantDequantLinearHelper:
[6] Assertion failed: axis == INVALID_AXIS && "Quantization axis attribute is not valid with a single quantization scale"
[01/06/2023-17:32:21] [E] Failed to parse onnx file
[01/06/2023-17:32:21] [I] Finish parsing network model
[01/06/2023-17:32:21] [E] Parsing model failed
[01/06/2023-17:32:21] [E] Failed to create engine from model or file.
[01/06/2023-17:32:21] [E] Engine set up failed

opened by nanmi 1

evaluation_with_imagenet.py is failure
运行官方示例跑Resnet50报错： ppq/ppq/samples/Imagenet/evaluation_with_imagenet.py Test: [700 / 781] [email protected] 75.843 (75.843) [email protected] 92.812 (92.812) Evaluating Model...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [00:34<00:00, 22.68it/s]

[email protected] 75.804 [email protected] 92.808 [Warning] File Output/resnet50.onnx is already existed, Exporter will overwrite it. /opt/conda/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:53: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider' warnings.warn("Specified provider '{}' is not in available provider names." Traceback (most recent call last): File "evaluation_with_imagenet.py", line 84, in evaluate_onnx_module_with_imagenet( File "/home/li.sun/github/ppq/ppq/samples/Imagenet/Utilities/Imagenet/imagenet_util.py", line 103, in evaluate_onnx_module_with_imagenet sess = onnxruntime.InferenceSession(path_or_bytes=onnxruntime_model_path, providers=['CUDAExecutionProvider']) File "/opt/conda/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "/opt/conda/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 310, in _create_inference_session sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model) onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from Output/resnet50.onnx failed:Type Error: Type (tensor(float)) of output arg (PPQ_Variable_2) of node (PPQ_Operation_0) does not match expected type (tensor(int8)).

环境：

ppq git commit id: commit 76e03261bad580e7c52e6f0856034fa9313f69b5 (HEAD -> master, origin/master, origin/HEAD) Author: AwesomeCodingBoy [email protected] Date: Tue Dec 13 14:03:47 2022 +0800

Update inference_with_ncnn.md (#324)

onnxruntime 1.13.1 onnxruntime-gpu 1.13.1 或 onnxruntime 1.8.1 onnxruntime-gpu 1.8.1
opened by yuyinsl 2
How to export the quantized onnx file and save weight in int8 format?

Hi, More than appreciate to provide wonderful tutorials on bilibili. As the beginner in quantization, I do learn a lot from a series of class. To deploy onnx model to a specified TPU, the current solution offered by the vendor is to store the output of the ONNX file in int8 format. However, the current default storage format is float32. Could you please tell me how to change this setting?

More details you could check this issue: https://github.com/sophgo/tpu-mlir/issues/51. They suggested us to export onnx in first format.

opened by Jackycheng0808 1
Convert Yolov4

Hi, I have a yolov4 model, that I want to run on TensorRT INT8. I read the documentation but having a hard time following it as an English speaker. Can you please guide me on how do I convert the model and prepared dataset for ProgramEntrance.py script. I have dataset in Yolo format.

Thanks

opened by Sayyam-Jain 1
怎么看模型适不适合量化，能不能从量化中获益？

你好从B站视频了解这个项目的，B站的视频讲得很清楚，已经一键三连。视频中提到有些网络从量化中并不能获益，可能还有负反馈，但是没提到怎么详细的判断，就怕一顿操作猛如虎，一看结果二百五。这个和平台，推理框架有关吗？比如考虑android，arm64, ncnn 这个方向有什么好的判断准则吗？谢谢！

opened by zuowanbushiwo 3

Releases(v0.6.5)

v0.6.5(Sep 2, 2022)
Analyzer

添加了新的分析方法 statistical_analyse

允许分析多输出算子

重新设计了 cosine 相似度的计算方式

API

添加了新的api: load_native_graph

添加了新的api: register_network_quantizer

添加了新的api: register_network_parser

添加了新的api: register_network_exporter

允许以setting=None调用api函数

Executor

支持1d, 3d卷积,

支持1d 3d pooling

支持1d 3d 反卷积

支持 lstm, gru

支持 sin, cos

支持 abs

支持 sum

支持 Erf, Elu, Reciprocal

重写了 resize, slice 与 scatterND 实现

移除了注册算子的限制条件，现在你可以覆盖ppq内部的算子实现

修正了Conv, Pooling, ConvTranspose, Pad中的padding问题，适配onnx 1d, 2d, 3d padding，并将以较高性能运行

Dispatcher

添加了新的调度器 purseus, allin

添加了新的数据类型抽象 opsocket，在下一版本中该抽象将被移入ppq.IR

默认子图切分方法更改为 purseus

添加调度报警信息

Observer

添加了新的calibration方法OrderPreserving，保序量化将被应用在分类网络当中，提升分类性能

添加了mse的非对称实现，并添加了c++的mse实现

Graph

支持1d, 3d卷积与反卷积的bn融合

Variable添加属性 shape，可以直接修改 shape 来设定 dynamic shape

图匹配引擎允许以ep_expr = None进行模式匹配

支持图的复制

Optim

LSQ 算法被重写，性能大幅提升, Advanced optimization 与 LSQ 算法合并，现在被称为CuLSQ

Brecq 算法被移入 legacy，不推荐使用

layerSplit, BiasCorrection 算法被重写，性能提升

Laerwise Equalization 算法被重写，现在支持 1,2,3d 卷积与反卷积，支持 include act

修正了 average pooling 算子对齐的错误

修正了 bias, pad 量化的相关问题

移除了RuntimePerlayerCalibrationPass，相关参数不再起作用

移除了ConstantBakingPass，相关参数不再起作用

移除了InplaceQuantizationSettingPass，相关参数不再起作用

移除了 fuse_conv_add 设置选项，相关优化过程被移入 legacy 文件，现在必须手动调用

Doc

添加了常见优化过程文档

添加了 yolo 量化相关例子

添加了新的入门教程示例代码

Cuda

重写了量化核心函数，性能提升

重写了量化梯度传播函数，性能提升

现在在编译开始时，会自动移除编译锁

Core

为 TQC 添加了新的属性 Visibilty，将使用该属性控制 TQC 导出能见度

修改了一些属性名字，并将它们写入 ppq.common.py

TensorQuantizationConfig中的函数__is_revisable现在是一个公有函数，并被重命名为is_revisable

Other

Import TensorRT的警告现在只在导出的时候才会发出

添加了 snpe 1.6.3 的支持

添加了 tengine 的支持

修复了一系列错误

Source code(tar.gz)
Source code(zip)
v0.6.4(Jun 1, 2022)

重做计算图操作接口，添加了函数 remove_operation, remove_variable, insert_op_on_var, insert_op_between_var, create_link_with_var, create_link_with_op, truncate_on_var

重做 onnxruntime 导出逻辑，重做 onnx oos 导出逻辑

更新了 lsq 算法，加速执行

更新了 ssd 算法，加速执行

更新了 core.ffi，现在编译不了的话会给你报告错误。

添加了几个api函数，包括 manop 与 quantize_native_model 它们允许你手动控制优化逻辑。

添加了第二类模式匹配功能

添加了 gru 分解的相关逻辑

添加了图 api 的测试类

添加了QNN导出逻辑

添加了 swish, mish 激活函数的融合逻辑

添加了 FPGAQuantizer

添加 mod 算子支持，添加 softplus 算子支持，添加 gru 算子支持

移除了 misc 文件夹，其中代码已经不被使用。

修复了 pad 顺序不对的问题

修复了创建变量时变量可能重名的问题

修复了 path_matching 中中间结果没有被复制，从而导致结果可能出现错误的问题

修复了 matex gemm split pass 的一些不引人注意的 bug

修复了 delete_isolated 函数的一些错误

修复了一个PPL_DSP_TI_INT8 被错误命名为 PPL_DSP_TI_IN8 的问题
Source code(tar.gz)
Source code(zip)
v0.6.3(Mar 30, 2022)
Fix many bugs.

Refine LSQ, SSD, DFQ Algorithm.

Add some exporters.

Source code(tar.gz)
Source code(zip)
v0.6.2(Mar 18, 2022)
Scale and offset are now always torch.Tensor with dtype=fp32 for training your network.

PPQ will display network snapshot when quantize your network.

Add brecq & lsq algorithms

Cuda kernels has been refined, more cuda kernels are introduced into ppq.

Add an exporter for dumping onnx quantization model.

Test cases are introduced here since ppq 0.6.2

Source code(tar.gz)
Source code(zip)

Owner

[email protected]

GitHub Repository

Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"

One2Set This repository contains the code for our ACL 2021 paper “One2Set: Generating Diverse Keyphrases as a Set”. Our implementation is built on the

63 Jan 05, 2023

HyperLib: Deep learning in the Hyperbolic space

HyperLib: Deep learning in the Hyperbolic space Background This library implements common Neural Network components in the hypberbolic space (using th

105 Dec 25, 2022

10x faster matrix and vector operations

Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations. If yo

2.3k Jan 09, 2023

An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

SERank An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow

44 Oct 20, 2022

A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

TecoGAN-PyTorch Introduction This is a PyTorch reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution (VSR). Please refer to

165 Dec 17, 2022

This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"

Learning Invariant Representation for Unsupervised Image Restoration (CVPR 2020) Introduction This is an implementation for the paper "Learning Invari

88 Nov 07, 2022

PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

Maria: A Visual Experience Powered Conversational Agent This repository is the Pytorch implementation of our paper "Maria: A Visual Experience Powered

22 Dec 12, 2022

Multi-Output Gaussian Process Toolkit

Multi-Output Gaussian Process Toolkit Paper - API Documentation - Tutorials & Examples The Multi-Output Gaussian Process Toolkit is a Python toolkit f

113 Nov 25, 2022

Person Re-identification

Person Re-identification Final project of Computer Vision Table of content Person Re-identification Table of content Students: Proposed method Dataset

4 Jun 17, 2021

Generative vs Discriminative: Rethinking The Meta-Continual Learning (NeurIPS 2021)

Generative vs Discriminative: Rethinking The Meta-Continual Learning (NeurIPS 2021) In this repository we provide PyTorch implementations for GeMCL; a

4 Apr 15, 2022

Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Generated Images"

Reverse_Engineering_GMs Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Gener

100 Dec 18, 2022

Ppq - A powerful offline neural network quantization tool with custimized IR

Related tags

Overview

PPL Quantization Tool(PPL 量化工具)

Features

Installation

Tutorials and Examples

Contact Us

Other Resources

Contributions

Benchmark

License

Comments

配置信息：

问题描述：

Releases(v0.6.5)

v0.6.5(Sep 2, 2022)

v0.6.4(Jun 1, 2022)

v0.6.3(Mar 30, 2022)

v0.6.2(Mar 18, 2022)

Owner

Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"

HyperLib: Deep learning in the Hyperbolic space

10x faster matrix and vector operations

An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"

PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

Multi-Output Gaussian Process Toolkit

Person Re-identification

Generative vs Discriminative: Rethinking The Meta-Continual Learning (NeurIPS 2021)

Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Generated Images"

Dynamic wallpaper generator.

[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

PyTorch Implementation of Region Similarity Representation Learning (ReSim)

An excellent hash algorithm combining classical sponge structure and RNN.

CryptoFrog - My First Strategy for freqtrade

Spatial Transformer Nets in TensorFlow/ TensorLayer

Official PyTorch implementation of UACANet: Uncertainty Aware Context Attention for Polyp Segmentation

Single Red Blood Cell Hydrodynamic Traps Via the Generative Design

Tool for live presentations using manim