Web Analytics

3 Optimizing Quantization Modes in Kneron-Flow Toolchain: Balancing SNR and FPS

This manual explains how to configure quantization bitwidth modes in the Kneron-Flow Toolchain (versions 0.24.0 and later) to achieve optimal trade-offs between model accuracy (SNR) and inference speed (FPS) for edge AI deployment on Kneron NPUs (e.g., KDP730).

3.1 Key Bitwidth Configuration Parameters
3.2 Recommended Configuration Workflow
3.3 SNR vs. FPS Trade-off
3.4 Debugging mixbw Mode

3.1 Key Bitwidth Configuration Parameters

  1. datapath_bitwidth_mode: Controls data quantization for activations and intermediate tensors.

    • Options: "int8", "int16", "mix balance", "mix light", "mixbw"
    • Notes:
      • "int16" is not supported on KDP520.
      • "mixbw" (new in Toolchain 0.29.0) automatically selects bitwidth for Conv layers based on quantization sensitivity analysis.
  2. weight_bitwidth_mode: Controls weight quantization for layers like Conv/Gemm.

    • Options: "int8", "int16", "int4", "mix balance", "mix light", "mixbw"
    • Notes:
      • "int16" is not supported on KDP520; "int4" is not supported on KDP720.
      • "mixbw" (new in Toolchain 0.29.0) automatically selects bitwidth for Weight node based on quantization sensitivity analysis.

Follow these steps to efficiently optimize your model:

3.2.1 Start with int8 for Maximum FPS

Use this if SNR ≥ 20 dB is acceptable. All high-compute ops (Conv, BatchNorm, Gemm, etc.) use 8-bit quantization.

# int8 mode (fastest FPS)  
bie_path = km.analysis(  
    input_mapping,  
    output_dir="/data1/kneron_flow",  
    datapath_bitwidth_mode="int8",  
    weight_bitwidth_mode="int8",  
    model_in_bitwidth_mode="int8",  
    model_out_bitwidth_mode="int8",  
    cpu_node_bitwidth_mode="int8"  
)  

3.2.2 Try mix light for Better SNR

If int8 fails, use mix light to quantize smaller-channel Conv layers and BatchNorm in 16-bit. This mode generally achieves a good balance between FPS and SNR. In mix light mode and the following modes, it is recommended to set the model_input, model_out, and CPU bitwidth to 16-bit.

# mix light mode  
bie_path = km.analysis(  
    input_mapping,  
    output_dir="/data1/kneron_flow",  
    datapath_bitwidth_mode="mix light",  
    weight_bitwidth_mode="mix light",  
    model_in_bitwidth_mode="int16",  
    model_out_bitwidth_mode="int16",  
    cpu_node_bitwidth_mode="int16"  
)

3.2.3 Use mixbw for Sensitivity-Guided Quantization

If mix light precision is insufficient, use mixbw. This mode analyzes Conv node sensitivity and automatically prioritizes 16-bit quantization for sensitive Conv layers. Control compute overhead with flops_ratio (default=0.2). mixbw mode may need more time and disk space to evaluate quant sensitivity, but its fps is still faster than all int16. When using mixbw, the model_in_bitwidth_mode, model_out_bitwidth_mode, and cpu_node_bitwidth_mode are always int16 and are not changable.

# mixbw mode
bie_path = km.analysis(  
    input_mapping,  
    output_dir="/data1/kneron_flow",  
    datapath_bitwidth_mode="mixbw",  
    weight_bitwidth_mode="mixbw",  
    flops_ratio=0.2  
)

When using "mixbw", only the following combinations are valid:

  1. datapath_bitwidth_mode="mixbw", weight_bitwidth_mode="int16": Only data bitwidth is included in sensitivity ranking.
  2. datapath_bitwidth_mode="int16", weight_bitwidth_mode="mixbw": Only weight bitwidth is included in sensitivity ranking.
  3. datapath_bitwidth_mode="mixbw", weight_bitwidth_mode="mixbw": Both data and weight bitwidths are included.

3.2.4 Useint16 for Maximum SNR

If other settings fail, use "int16", where all ops are quantized to int16.(slowest FPS).

# int16 mode (highest SNR)  
bie_path = km.analysis(  
    input_mapping,  
    output_dir="/data1/kneron_flow",  
    datapath_bitwidth_mode="int16",  
    weight_bitwidth_mode="int16",  
    model_in_bitwidth_mode="int16",  
    model_out_bitwidth_mode="int16",  
    cpu_node_bitwidth_mode="int16"  
)

3.3 SNR vs. FPS Trade-off

Figure 1. SNR-FPS Chart

3.4 Debugging mixbw Mode

Set export MIXBW_DEBUG=True to check predicted modelout SNR in {output_dir}/data and {output_dir}/weight. Lower SNR indicates higher sensitivity, requiring 16-bit quantization.

To analyze Conv layer sensitivity, enable debug mode:

export MIXBW_DEBUG=True  

Figure 2. Sentivity Analysis

By following this workflow, developers can systematically optimize models for deployment on Kneron NPUs while balancing accuracy and performance.