Quantization In Neural Networks

3 min readNov 25, 2021

Deep Learning | Quantization| Number of parameters | Neural Network | Performance

Neural Networks have seen exponential growth recently on mobile/embedded platforms because of the benefits i.e.reduced cost, low latency, security, and power consumption. While computational resources and memory availability might not be a bottleneck on desktop and cloud computers but mobile/embedded computer platforms have limitations on computing capabilities and memory availability.

Running a neural network on hardware means millions of arithmetic operations; multiplication and addition. These arithmetic operations with standard 32-bit floating-point precision result in significantly higher memory access and computation cost on embedded platforms. To address this challenge, there exist a number of methods for efficient inference of the neural networks models [8]. Some of the methods include

Pruning Neural Networks
Deep Compression
Data Quantization
Low-Rank Approximation
Trained Ternary Quantization

In this article, we will address these questions

What is quantization?
Why is quantization needed?
What is the memory requirement of a typical neural network?
What is the arithmetic complexity (number of parameters and operations) of a simple neural network?
What are different quantization methods?
Which frameworks support quantization?

Quantization in Neural Network

Floating points operations are replaced with 8-bit integer operations. This will significantly reduce the memory storage requirement for the network’s parameters because the number of operations in neural networks is typically more than a few million. Here are some examples of resource utilization of different models.

Number of parameters and number of operations

VGG-16 has 138 million parameters and performs 15300 million mult-add operations for single image classification with 71.5 % accuracy.
AlexNet has 60 million operators and performs 720 million mult-add operations with 57.2 % accuracy.
YOLOv3 performs 39 billion operations to process one image.
The MobileNet model has only 13 million parameters with the usual 3 million for the body and 10 million for the final layer and 0.58 Million mult-adds.

Memory requirement

VGG-16 has over 500 MB requirements.
Alexnet has over 200 MB requirements.

The above memory requirement is typically high for embedded platforms therefore it is good to quantize models to 8-bit integers. This way the normal multiplications can be realized using shift and addition operations.

Going below degrades the performance to an unacceptable level. However, it is an active area of research.

Benefits of Quantization (32-bit floating-point to 8-bit integer operations)

Memory transfer speed improvement by ~ 4 times
Reduced storage requirements of the network graphs because memory used to store all the weights and bias is reduced by 4 times.
Power consumption reduction because of reduced memory access and increased compute efficiency. The general assumption is; larger the model, the larger the memory reference, and the larger would be energy consumption.
Compute performance gains

Two different approaches to performing quantizations

Post-training Quantizations
For example, a pre-trained 32-bit floating-point model is converted to an 8-bits quantized model.
Quantize Aware training
quantize aware training is considered to produce better model accuracy [ link]

Number of Parameters of a Simple Neural Network

Let's say we have 10 inputs, 8 nodes in each hidden layer (2 hidden layers), and 10 nodes in the output layer.

The number of parameters can be calculated as

weights: 10×8+8×8+8×10 = 224
Bias: 16+10 = 26 (There are 16 neurons in the hidden layers and 10 in the output layer.)
Total parameters: 224 + 26 = 250