To run neural networks efficiently at the edge on mobile, IoT, and other embedded devices, developers strive to optimize their machine learning (ML) models' size and complexity while taking advantage of hardware acceleration for inference. For these types of devices, long battery life and thermal control are essential, and performance is often measured on a per-watt basis. Optimized ML models can help achieve these goals by reducing computations, memory traffic, latency, and storage requirements while making more efficient use of the hardware.

In this blog post, we take a closer look at ML model optimization techniques and how solutions from Qualcomm Technologies and Qualcomm Innovation Center can help developers implement them.

While developers put a lot of effort into a model's design, they can also employ the following optimization techniques to reduce a model's size and complexity:

Quantization: reduces the number of bits used to represent a model's weights and activations (e.g., reducing weights from 32-bit floating point values to 8-bit integers).

Compression: removes redundant parameters or computations with little or no influence on predictions.

The key to success with these optimization techniques is implementing them without significantly affecting the model's predictive performance. In practice, this is often done by hand through a lot of trial and error. This typically involves iterating on model optimizations, testing the model's predictive and runtime performance, and then repeating the process to compare the results against past tests.

Given its importance on mobile, ML model optimization is an area where we continue to do extensive research. Traditionally, we've shared our breakthroughs via conference papers and workshops, but for these optimization techniques, we decided to increase accessibility by releasing our AI Model Efficiency Toolkit (AIMET). AIMET provides a collection of advanced model compression and quantization techniques for trained neural network models.

AIMET supports many features, such as Adaptive Rounding (AdaRound) and Channel Pruning, and the results speak for themselves. For example, AIMET's data-free quantization (DFQ) algorithm quantizes 32-bit weights to 8-bits with negligible loss in accuracy. AIMET's AdaRound provides state-of-the-art post-training quantization for 8-bit and 4-bit models, with accuracy very close to the original FP32 performance. AIMET's spatial SVD plus channel pruning is another impressive example because it achieves a 50% MAC (multiply-accumulate) reduction while retaining accuracy within 1% of the original uncompressed model.

In May 2020, our Qualcomm Innovation Center (QuIC) open-sourced AIMET. This allows for collaboration with other ML researchers to continually improve model efficiency techniques that benefit the ML community.

Figure 1 shows how AIMET fits into a typical ML model optimization pipeline: