2 Post-training Quantization(PTQ) Flow and Steps

Kneron toolchain delivers the whole flow to prepare a floating point model, quantize it into a fixed-point model, and compile it into executable binaries. Please check Workflow Overview for Kneron toolchain details. Post-training quantization is one of the key procedures in the toolchain work flow. In this section, we introduce Knerex, our Fixed-Point Model Quantization and Generation Tool. Quantization is the step where the floating-point weight are quantized into fixed-point to reduce the size and the calculation complexity. Currently, the quantization method of Knerex is based on Post Train Quantization(PTQ). In the near future, Quantization Aware Training(QAT) will be added on Knerex as an alternative to PTQ. Next, we will explain the principle and steps of Knerex, which guides you in terms of PTQ principles, model preparation, model verification, model quantification and compilation, performance analysis and tuning, precision analysis and tuning, etc.

2.1 Introduction to PTQ Flow
2.2 Hardware Supported Operators List
2.3 Floating-Point Model Preparation
2.4 Kneron End to End Simulator
2.5 Model Quantitation and Compile
2.7 Precision Analysis for Model Quantitation
2.8 FAQ

2.1 Introduction to PTQ Flow

Model conversion refers to the process of converting the original floating-point model to a Regularized Piano-ONNX model.

The original floating-point model (also referred to as a floating-point model in some parts of the text) refers to a model that you have trained using DL frameworks such as TensorFlow/PyTorch, with a calculation precision of float32. Regularized Piano-ONNX model is a model format suitable for running on Kneron AI accelerator chips.

The complete model development process with the Kneron toolchain involves five important stages: floating-point model preparation, model checking and performance evaluation by Kneron End to End Simulator, model transformation, and precision evaluation.

The floating-point model preparation stage is to prepare floating-point model for the model conversion tool. These models are usually obtained based on public DL training frameworks. It is important to note that the models need to be exported in a format supported by the Kneron toolchain. For specific requirements and recommendations, please refer to the "Floating-point Model Preparation" section.

The model checking stage is used to ensure that the algorithm model meets the chip's requirements. Kneron provides designated tools to complete this stage of the check. For cases that do not meet the requirements, the verification tool will clearly provide specific operator information that does not meet the requirements, making it easier for you to adjust the model based on the operator constraints. For specific usage, please refer to the "Floating Point Model Check" section.

The performance evaluation stage provides a series of tools to evaluate model performance. Before application deployment, you can use these tools to verify if the model's performance meets the application requirements. For cases where the performance falls short of expectations, you can also refer to the performance optimization suggestions provided by Kneron for tuning. For specific evaluation details, please refer to the Model Performance Evaluation section.

The model conversion stage converts the floating-point model to a fixed-point model. In order for the model to run efficiently on Kneron chips, the Kneron conversion tool completes key steps such as model optimization, quantization, and compilation. Kneron's quantization method has been verified through long-term technology and production testing and can guarantee precision loss of less than 0.01 for most typical models. Please refer to Model Quantization for details.

The accuracy evaluation stage provides E2E Simulator for evaluating the accuracy of the model. In most cases, the Kneron-converted model can maintain a similar level of accuracy as the original floating-point model. Before deploying the application, you can use the Kneron tools to verify whether the model's accuracy meets expectations. For cases where the accuracy is lower than expected, you can also refer to the performance optimization recommendations provided by Kneron for tuning. Please refer to the Model Accuracy Analysis and Tuning section for specific evaluations.

2.2 Hardware Supported Operators List

2.3 Floating-Point Model Preparation

2.3.1 How To Use ONNX Converter:

Converting the original floating-point model to a Regularized Piano-ONNX model. ONNX_Convertor is an open-source project on Github. If there is any bugs in the ONNX_Convertor project inside the docker, don't hesitate to try git pull under the project folder to get the latest update. And if the problem persists, you can raise an issue there. We also welcome contributions to the project.

The general process for model conversion is as following:

Convert the model from other platforms to onnx using the specific converters. See section 1 - 5. Optimize the onnx for Kneron toolchain using onnx2onnx.py. See section 6. Check the model and do further customization using editor.py (optional). See section 7 TIPS:

ONNX exported by Pytorch cannot skip step 1 and directly go into step 2. Please check section 2 for details.

If you're still confused reading the manual, please try our examples from https://github.com/kneron/ConvertorExamples

2.3.2. IP Evaluation (Model Evaluation)

Before we start quantizing the model and try simulating the model, we need to test if the model can be taken by the toolchain structure and estimate the performance. IP evaluator is such a tool which can estimate the performance of your model and check if there is any operator or structure not supported by our toolchain.

2.4 Kneron End to End Simulator

This project allows users to perform image inference using Kneron's built in simulator.

2.5 Model Quantitation and Compile

In this stage, you will complete the conversion from a floating-point model to a fixed-point model. After this stage, you will have a model that can run efficiently on the Kneron chip. The ONNX converter tools are used for model conversion, and during the conversion process, important processes such as model optimization and calibration quantization are completed. Calibration requires preparation of calibration data in accordance with the model's pre-processing requirements. You can refer to the Kneron End to End Simulator section to prepare the calibration data in advance. To help you fully understand the model conversion, this section will introduce the use of conversion tools, the interpretation of internal conversion processes, the interpretation of conversion results, and the interpretation of conversion outputs in turn.

2.5.1 How To Set Up Param Values

No.	Parameter name	Parameter Configuration Description	Must/Optional
00	p_onnx	Parameter Usage: Path to ONNX file. Range: N/A. Default value: N/A. Description: It should have passed through the ONNX converter.	Must
01	np_txt	Parameter Usage: A dictionary of list of images in numpy format. Range: N/A. Default value: N/A. Description: The keys should be the names of input nodes of the model. e.g., {"input1": [img1, img2]}, here img1/img2 are two images -> preprocess -> numpy 3D array (HWC)	Must
02	platform	Parameter Usage: Choose the platform architecture to generate fix models. Range: "520" / "720" / "530" / "630". Default value: N/A. Description: Correspond to our chip on your board. For example, “520” for the KL520 chip.	Must
03	optimize	Parameter Usage: Level of optimization. The larger number, the better model performance, but takes longer. Range: "0" / "1" / "2" / "3" / "4" Default value: "0" Description: * "0": generated quantization fix model. * "1": bias adjust parallel, no fm cut improve * "2": bias adjust parallel, w fm cut improve	Optional
04	datapath_range_method	Parameter Usage: Method to analyze list of images data. Range: "percentage" / "mmse" Default value: "percentage" Description: * “mmse”: use snr-based-range method. * “percentage”: use arbitrary percentage.	Optional
05	data_analysis_pct	Parameter Usage: Applicable when datapath_range_method set to "percentage". Intercept the data range. Range: 0.0 ~ 1.0 Default value: 0.999, set to 1.0 if detection model Description: It is used to exclude extreme values. For example, the default setting is 0.999. It means 0.1% of absolute maximum value will be removed among all data.	Optional
06	data_analysis_threads	Parameter Usage: Multi Thread setting Range: 1 ~ number of cpu cores / memory available Default value: 4 Description: Number of threads to use for data analysis for quantization.	Optional
07	datapath_bitwidth_mode	Parameter Usage: Specify the data flow of the generated fix model in “int8” or “int16”. Range: "int8" / "int16" Default value: "int8" Description: The input/output data flows of most operator nodes set to int8 in default. Through this parameter , the data flows can be adjusted to int16 under operator node constraints.	Optional
08	weight_bitwidth_mode	Parameter Usage: Specify the weight flow of the generated fix model in “int4”, “int8” or “int16”. Range: "int4" / "int8" / "int16" Default value: "int8" Description: The weight flows of most operator nodes set to int8 in default. Through this parameter , the weight flows can be adjusted to int4 or int16 under operator node constraints.	Optional
09	model_in_bitwidth_mode	Parameter Usage: Specify the generated fix model input in “int8” or “int16”. Range: "int8" / "int16" Default value: "int8" Description: The model input set to int8 in default. Through this parameter , the model input can be adjusted to int16 under operator node constraints.	Optional
10	model_out_bitwidth_mode	Parameter Usage: Specify the generated fix model output in “int8” or “int16”. Range: "int8" / "int16" Default value: "int8" Description: The model output set to int8 in default. Through this parameter , the model output can be adjusted to int16 under operator node constraints.	Optional
11	percentile	Parameter Usage: The range to search. Range: 0.0 ~ 1.0 Default value: 0.001 Description: It is used under “mmse” mode. The larger the value, the larger the search range, the better the performance but the longer the simulation time.	Optional
12	outlier_factor	Parameter Usage: Used under 'mmse' mode.The factor applied on outliers. Range: 1.0 ~ 2.0 or higher. Default value: 1.0 Description: For example, if clamping data is sensitive to your model, set outlier_factor to 2 or higher. Higher outlier_factor will reduce outlier removal by increasing range.	Optional
13	quantize_mode	Parameter Usage: Need extra tuning or not. Range: "default" / "post_sigmoid" Default value: "default" Description: If a model's output nodes were ALL sigmoids and had been removed, choose "post_sigmoid" for better performance.	Optional
14	quan_config	Parameter Usage: User can pass in a dictionary to set constraints for quantization. Range: N/A Default value: None Description: Set scale, radix, min/max value for specific operator node.	Optional
15	fm_cut	Parameter Usage: Methods to search for best feature map cut. Range: "default" / "deep_search" / "performance" (not available yet) Default value: "default" Description: To improve the efficiency of NPU.	Optional
16	p_output	Parameter Usage: Location to save the generated fix models. Range: N/A Default value: "/data1/kneron_flow" Description:	Optional
17	weight_bandwidth	Parameter Usage: Set the weight bandwidth. Range: Default value: None Description:	Optional
18	dma_bandwidth	Parameter Usage: Set the dma bandwidth. Range: Default value: None Description:	Optional
19	mode	Parameter Usage: Running mode for the analysis. Range: 0 / 1 / 2 / 3 Default value: 2 Description: * 0: run ip_evaluator only. * 1: run knerex (for quantization) + compiler only. * 2: run knerex + dynasty + compiler + csim + bit-true-match check. dynasty will inference only 1 image and only check quantization accuracy of output layers. * 3: run knerex + dynasty + compiler + csim + bit-true-match check. dynasty will inference all images and dump results of all layers. It will provide the most detailed analysis but will take much longer time.	Optional
20	export_dynasty_dump	Parameter Usage: It will provide a dynasty dump for further analysis. Range: True / False Default value: False Description: The dump will be put in a zip file and saved to p_output.	Optional

2.5.2 Interpretation of Conversion Internal Process

Model optimization stage implements some operator optimization strategies that are suitable for the Kneron platform. The output of this stage is an decomposed.onnx, whose computation precision is still float32, and the optimization does not affect the computation results of the model. The input data requirements of the model are still the same as the origin.onnx in the previous stage.

2.5.3 Data Path Distribution Analyzer

Data path analyzer is to evaluate the data node dynamic range based on the given model and a set of given representative model inputs. It needs at least 100 inputs for good dynamic range analysis. The more data the better. It uses dynasty float values to run model inference and collect data path distribution. In specific, record layer min/max and channel min/max of data for each node with certain outlier. "data_analysis_pct" in Knerex input parameter configs is used to exclude extreme values. For example, the default setting is 0.999. It means 0.1% of absolute maximum value will be removed among all data. The remaining data will be involved into min/max calculation.

In PTQ, datapath analyzer results are used to calculate per-channel/per-layer ranges. These ranges are important figures while doing data quantization(calculating scale and data radix). There is one more additional min/max record with another outlier. Original for int8; additional for int16.

We will also provide "Fine Tune Range using SNR analysis" to decide the per channel data range by finding k-max/min clustering data. This method should provide a more robust data range. This method will be released soon.

2.5.4 INT16(Bitwidth mode) Configuration Instructions

During the fx model quantization, most of the operator node in the mode are quantized to int8. By configuring "datapath_bitwidth_mode" and "weight_bitwidth_mode", the data flow and weight flow of operator nodes can be calculated and quantize as int16 under operator node constraints. In addition, by configuring "in_bitwidth_mode" and "out_bitwidth_mode", the input and output of the fx model will consider as int16. This is handy to find quantization settings that have better performance for the model.

In the future, by configuring "quan_config", specify one or more particular operator nodes to be calculated and quantize in int16. For example, setting op node named "conv1" into int16, the output of "conv1" and inputs of all its children will consider as int16. For unsupported scenes, operator nodes should ignore int16 request and calculate the operator node in int8.

2.5.5 Per Channel Quantization

Compared to Per Layer Quantization, Per Channel Quantization can calculate the scale and radix for datapath and weight of every individual channel under operator node constraints, and it has a better quantization performance. The quantization infos provided by Per Channel Quantization should be protected by Per Layer Quantization results, since there could be extreme values and ranges in particular channels.

2.5.6 Bias Adjustment

By configuring "optimize", bias adjustment algorithm can be activated. The purpose of float point based bias adjustment algorithm is to improve bias quantization performance. First, it scans the graph of nodes and builds a list of all nodes with bias. So far only Conv, BN and Gemm are included as biased node. The bias are adjusted and optimized based on float point inference once per operator node. This option takes extra running time.

2.5.7 Conversion Output Interpretation

The previous section mentioned that the successful conversion of the model includes four parts, and this section will introduce the purpose of each output:

***.origin.onnx
***.decomposed.onnx
***.scaled.onnx
***.scaled.wqbi.onnx

Stage 0: .origin.onnx

The process of producing ".origin.onnx" can be referred to in the section "ONNX Converter". The computation accuracy of this model is exactly the same as the original float model used as input for the conversion, but some data preprocessing computations have been added to adapt to the Kneron platform. In general, you do not need to use this model. However, if there is an abnormality in the conversion result, providing this model to Kneron's technical support team can help you quickly resolve the issue.

Stage 1: .decomposed.onnx

The process of producing ".decomposed.onnx" model can be found before Model Quantization stage, during which some graph optimizations at the operator level are performed, such as operator fusion. Through visual comparison with the original float model, you can clearly see some changes in the operator structure level, but these do not affect the accuracy of the model's computations. In general, you do not need to use this model, but if the conversion result is abnormal, providing this model to Kneron's technical support can help you solve the problem quickly.

Stage 2 & 3: .scaled.onnx & .scaled.wqbi.onnx

The production process of ".scaled.onnx" and ".scaled.wqbi.onnx" are the result of Model Quantization stage. These model has completed the calibration and quantization processes, and the accuracy loss after quantization can be viewed here. ".scaled.onnx" is a must-use model in the accuracy verification process. ".scaled.wqbi.onnx" is an optional-use model. Bias adjustment is applied on this model to use calculated quantization infos to improve bias quantization performance.

2.6 Precision Analysis for Model Quantitation

2.6.1 E2E Simulator Check (Fixed Point)

Before going into the next section of compilation, E2E Simulator would help to ensure the quantized model do not lose too much precision.

We would use ktc.kneron_inference here, too. But here we are using the generated bie file as the input.

The python code would be like:

fixed_results = ktc.kneron_inference(input_data, bie_file=bie_path, input_names=["data_out"], platform=720)

The usage is almost the same as using onnx. In the code above, inf_results is a list of result data. bie_file is the path to the input bie. input_data is a list of input data after preprocess the input_names is a list of model input name. The requirement is the same as in section 3.3. If your platform is not 520, you need an extra parameter platform, e.g. platform=720 or platform=530.

As mentioned above, we do not provide any postprocess. In reality, you may want to have your own postprocess function in Python, too.

2.6.2 Dynasty Inference Dump on a Single Image

We provide dynasty inference dump on a single image file by turning on "export_dynasty_dump" in Parameter Configuration. It should dump the dynasty inference result for every operator nodes of fixed-point model. You can manually analysis the results between the float value and dynasty inference result.