3 Optimizing Quantization Modes in Kneron-Flow Toolchain: Balancing SNR and FPS
This manual explains how to configure quantization bitwidth modes in the Kneron-Flow Toolchain (versions 0.24.0 and later) to achieve optimal trade-offs between model accuracy (SNR) and inference speed (FPS) for edge AI deployment on Kneron NPUs (e.g., KDP730).
3.1 Key Bitwidth Configuration Parameters
3.2 Recommended Configuration Workflow
3.3 SNR vs. FPS Trade-off
3.4 Debugging mixbw Mode
3.1 Key Bitwidth Configuration Parameters
-
datapath_bitwidth_mode: Controls data quantization for activations and intermediate tensors.- Options:
"int8","int16","mix balance","mix light","mixbw" - Notes:
"int16"is not supported on KDP520."mixbw"(new in Toolchain 0.29.0) automatically selects bitwidth for Conv layers based on quantization sensitivity analysis.
- Options:
-
weight_bitwidth_mode: Controls weight quantization for layers like Conv/Gemm.- Options:
"int8","int16","int4","mix balance","mix light","mixbw" - Notes:
"int16"is not supported on KDP520;"int4"is not supported on KDP720."mixbw"(new in Toolchain 0.29.0) automatically selects bitwidth for Weight node based on quantization sensitivity analysis.
- Options:
3.2 Recommended Configuration Workflow
Follow these steps to efficiently optimize your model:
3.2.1 Start with int8 for Maximum FPS
Use this if SNR ≥ 20 dB is acceptable. All high-compute ops (Conv, BatchNorm, Gemm, etc.) use 8-bit quantization.
# int8 mode (fastest FPS)
bie_path = km.analysis(
input_mapping,
output_dir="/data1/kneron_flow",
datapath_bitwidth_mode="int8",
weight_bitwidth_mode="int8",
model_in_bitwidth_mode="int8",
model_out_bitwidth_mode="int8",
cpu_node_bitwidth_mode="int8"
)
3.2.2 Try mix light for Better SNR
If int8 fails, use mix light to quantize smaller-channel Conv layers and BatchNorm in 16-bit. This mode generally achieves a good balance between FPS and SNR.
In mix light mode and the following modes, it is recommended to set the model_input, model_out, and CPU bitwidth to 16-bit.
# mix light mode
bie_path = km.analysis(
input_mapping,
output_dir="/data1/kneron_flow",
datapath_bitwidth_mode="mix light",
weight_bitwidth_mode="mix light",
model_in_bitwidth_mode="int16",
model_out_bitwidth_mode="int16",
cpu_node_bitwidth_mode="int16"
)
3.2.3 Use mixbw for Sensitivity-Guided Quantization
If mix light precision is insufficient, use mixbw. This mode analyzes Conv node sensitivity and automatically prioritizes 16-bit quantization for sensitive Conv layers. Control compute overhead with flops_ratio (default=0.2). mixbw mode may need more time and disk space to evaluate quant sensitivity, but its fps is still faster than all int16. When using mixbw, the model_in_bitwidth_mode, model_out_bitwidth_mode, and cpu_node_bitwidth_mode are always int16 and are not changable.
# mixbw mode
bie_path = km.analysis(
input_mapping,
output_dir="/data1/kneron_flow",
datapath_bitwidth_mode="mixbw",
weight_bitwidth_mode="mixbw",
flops_ratio=0.2
)
When using "mixbw", only the following combinations are valid:
datapath_bitwidth_mode="mixbw",weight_bitwidth_mode="int16": Only data bitwidth is included in sensitivity ranking.datapath_bitwidth_mode="int16",weight_bitwidth_mode="mixbw": Only weight bitwidth is included in sensitivity ranking.datapath_bitwidth_mode="mixbw",weight_bitwidth_mode="mixbw": Both data and weight bitwidths are included.
3.2.4 Useint16 for Maximum SNR
If other settings fail, use "int16", where all ops are quantized to int16.(slowest FPS).
# int16 mode (highest SNR)
bie_path = km.analysis(
input_mapping,
output_dir="/data1/kneron_flow",
datapath_bitwidth_mode="int16",
weight_bitwidth_mode="int16",
model_in_bitwidth_mode="int16",
model_out_bitwidth_mode="int16",
cpu_node_bitwidth_mode="int16"
)
3.3 SNR vs. FPS Trade-off
- SNR : int16 > mixbw > mix light > int8
- FPS : int8 > mix light > mixbw > int16
Figure 1. SNR-FPS Chart
- both_0.2: Both data and weight are included in sensitivity analysis, with flops_ratio=0.2.
- data_0.2: Only data is analyzed; weight is set to int16.
- weight_0.2: Only weight is analyzed; data is set to int16.
3.4 Debugging mixbw Mode
Set export MIXBW_DEBUG=True to check predicted modelout SNR in {output_dir}/data and {output_dir}/weight. Lower SNR indicates higher sensitivity, requiring 16-bit quantization.
To analyze Conv layer sensitivity, enable debug mode:
export MIXBW_DEBUG=True
Figure 2. Sentivity Analysis
By following this workflow, developers can systematically optimize models for deployment on Kneron NPUs while balancing accuracy and performance.