3 Optimizing Quantization Modes in Kneron-Flow Toolchain: Balancing SNR and FPS

This manual explains how to configure quantization bitwidth modes in the Kneron-Flow Toolchain (versions 0.24.0 and later) to achieve optimal trade-offs between model accuracy (SNR) and inference speed (FPS) for edge AI deployment on Kneron NPUs (e.g., KDP730).

3.1 Key Bitwidth Configuration Parameters
3.2 Recommended Configuration Workflow
3.3 SNR vs. FPS Trade-off
3.4 Debugging mixbw Mode

3.1 Key Bitwidth Configuration Parameters

datapath_bitwidth_mode: Controls data quantization for activations and intermediate tensors.
- Options: "int8", "int16", "mix balance", "mix light", "mixbw"
- Notes:
  - "int16" is not supported on KDP520.
  - "mixbw" (new in Toolchain 0.29.0) automatically selects bitwidth for Conv layers based on quantization sensitivity analysis.
  - If the model has ReduceL2 operator, we suggest using "mix light" or "int16" for datapath_bitwidth_mode.
weight_bitwidth_mode: Controls weight quantization for layers like Conv/Gemm.
- Options: "int8", "int16", "int4", "mix balance", "mix light", "mixbw"
- Notes:
  - "int16" is not supported on KDP520; "int4" is not supported on KDP720.
  - "mixbw" (new in Toolchain 0.29.0) automatically selects bitwidth for Weight node based on quantization sensitivity analysis.

3.2 Recommended Configuration Workflow

Follow these steps to efficiently optimize your model:

3.2.1 Start with `int8` for Maximum FPS

Use this if SNR ≥ 20 dB is acceptable. All high-compute ops (Conv, BatchNorm, Gemm, etc.) use 8-bit quantization.

# int8 mode (fastest FPS)  
bie_path = km.analysis(  
    input_mapping,  
    output_dir="/data1/kneron_flow",  
    datapath_bitwidth_mode="int8",  
    weight_bitwidth_mode="int8",  
    model_in_bitwidth_mode="int8",  
    model_out_bitwidth_mode="int8",  
    cpu_node_bitwidth_mode="int8"  
)

3.2.2 Try `mix light` for Better SNR

If int8 fails, use mix light to quantize smaller-channel Conv layers and BatchNorm in 16-bit. This mode generally achieves a good balance between FPS and SNR. In mix light mode and the following modes, it is recommended to set the model_input, model_out, and CPU bitwidth to 16-bit.

# mix light mode  
bie_path = km.analysis(  
    input_mapping,  
    output_dir="/data1/kneron_flow",  
    datapath_bitwidth_mode="mix light",  
    weight_bitwidth_mode="mix light",  
    model_in_bitwidth_mode="int16",  
    model_out_bitwidth_mode="int16",  
    cpu_node_bitwidth_mode="int16"  
)

3.2.3 Use `mixbw` for Sensitivity-Guided Quantization

If mix light precision is insufficient, use mixbw. This mode analyzes Conv node sensitivity and automatically prioritizes 16-bit quantization for sensitive Conv layers. Control compute overhead with flops_ratio (default=0.2). mixbw mode may need more time and disk space to evaluate quant sensitivity, but its fps is still faster than all int16. When using mixbw, the model_in_bitwidth_mode, model_out_bitwidth_mode, and cpu_node_bitwidth_mode are always int16 and are not changeable.

# mixbw mode
bie_path = km.analysis(  
    input_mapping,  
    output_dir="/data1/kneron_flow",  
    datapath_bitwidth_mode="mixbw",  
    weight_bitwidth_mode="mixbw",  
    flops_ratio=0.2  
)

When using "mixbw", only the following combinations are valid:

datapath_bitwidth_mode="mixbw", weight_bitwidth_mode="int16": Only data bitwidth is included in sensitivity ranking.
datapath_bitwidth_mode="int16", weight_bitwidth_mode="mixbw": Only weight bitwidth is included in sensitivity ranking.
datapath_bitwidth_mode="mixbw", weight_bitwidth_mode="mixbw": Both data and weight bitwidths are included.

3.2.4 Use`int16` for Maximum SNR

If other settings fail, use "int16", where all ops are quantized to int16.(slowest FPS).

# int16 mode (highest SNR)  
bie_path = km.analysis(  
    input_mapping,  
    output_dir="/data1/kneron_flow",  
    datapath_bitwidth_mode="int16",  
    weight_bitwidth_mode="int16",  
    model_in_bitwidth_mode="int16",  
    model_out_bitwidth_mode="int16",  
    cpu_node_bitwidth_mode="int16"  
)

3.3 SNR vs. FPS Trade-off

SNR : int16 > mixbw > mix light > int8
FPS : int8 > mix light > mixbw > int16

Figure 1. SNR-FPS Chart

both_0.2: Both data and weight are included in sensitivity analysis, with flops_ratio=0.2.
data_0.2: Only data is analyzed; weight is set to int16.
weight_0.2: Only weight is analyzed; data is set to int16.

3.4 Debugging mixbw Mode

Set export MIXBW_DEBUG=True to check predicted modelout SNR in {output_dir}/data and {output_dir}/weight. Lower SNR indicates higher sensitivity, requiring 16-bit quantization.

To analyze Conv layer sensitivity, enable debug mode:

export MIXBW_DEBUG=True

Figure 2. Sentivity Analysis

By following this workflow, developers can systematically optimize models for deployment on Kneron NPUs while balancing accuracy and performance.

3 Optimizing Quantization Modes in Kneron-Flow Toolchain: Balancing SNR and FPS

3.1 Key Bitwidth Configuration Parameters

3.2 Recommended Configuration Workflow

3.2.1 Start with int8 for Maximum FPS

3.2.2 Try mix light for Better SNR