Quantization In Neural Networks

Imran Bangash
3 min readNov 25, 2021

Deep Learning | Quantization| Number of parameters | Neural Network | Performance

Photo by Clint Adair on Unsplash

Neural Networks have seen exponential growth recently on mobile/embedded platforms because of the benefits i.e.reduced cost, low latency, security, and power consumption. While computational resources and memory availability might not be a bottleneck on desktop and cloud computers but mobile/embedded computer platforms have limitations on computing capabilities and memory availability.

Running a neural network on hardware means millions of arithmetic operations; multiplication and addition. These arithmetic operations with standard 32-bit floating-point precision result in significantly higher memory access and computation cost on embedded platforms. To address this challenge, there exist a number of methods for efficient inference of the neural networks models [8]. Some of the methods include

  • Pruning Neural Networks
  • Deep Compression
  • Data Quantization
  • Low-Rank Approximation
  • Trained Ternary Quantization

In this article, we will address these questions

  • What is quantization?
  • Why is quantization needed?
  • What is the memory requirement of a typical neural network?
  • What is the arithmetic complexity (number of parameters and operations) of a simple neural network?
  • What are different quantization methods?
  • Which frameworks support quantization?

Quantization in Neural Network

Floating points operations are replaced with 8-bit integer operations. This will significantly reduce the memory storage requirement for the network’s parameters because the number of operations in neural networks is typically more than a few million. Here are some examples of resource utilization of different models.

Number of parameters and number of operations

  • VGG-16 has 138 million parameters and performs 15300 million mult-add operations for single image classification with 71.5 % accuracy.
  • AlexNet has 60 million operators and performs 720 million mult-add operations with 57.2 % accuracy.
  • YOLOv3 performs 39 billion operations to process one image.
  • The MobileNet model has only 13 million parameters with the usual 3 million for the body and 10 million for the final layer and 0.58 Million mult-adds.

Memory requirement

  • VGG-16 has over 500 MB requirements.
  • Alexnet has over 200 MB requirements.

The above memory requirement is typically high for embedded platforms therefore it is good to quantize models to 8-bit integers. This way the normal multiplications can be realized using shift and addition operations.

Going below degrades the performance to an unacceptable level. However, it is an active area of research.

Benefits of Quantization (32-bit floating-point to 8-bit integer operations)

  • Memory transfer speed improvement by ~ 4 times
  • Reduced storage requirements of the network graphs because memory used to store all the weights and bias is reduced by 4 times.
  • Power consumption reduction because of reduced memory access and increased compute efficiency. The general assumption is; larger the model, the larger the memory reference, and the larger would be energy consumption.
  • Compute performance gains

Two different approaches to performing quantizations

  1. Post-training Quantizations
    For example, a pre-trained 32-bit floating-point model is converted to an 8-bits quantized model.
  2. Quantize Aware training
    quantize aware training is considered to produce better model accuracy [ link]

Number of Parameters of a Simple Neural Network

Let's say we have 10 inputs, 8 nodes in each hidden layer (2 hidden layers), and 10 nodes in the output layer.

The number of parameters can be calculated as

  • weights: 10×8+8×8+8×10 = 224
  • Bias: 16+10 = 26 (There are 16 neurons in the hidden layers and 10 in the output layer.)
  • Total parameters: 224 + 26 = 250
Source: Image by Author

The number of connections in the above neural network can be computed as (10×8)+(8×8)+(8×10) =224

Frameworks Supporting Quantizations

  1. TensorFlow Lite converter [ link ]
  2. Pytorch Quantization [ link ]
  3. ONNX quantization [ link ]
  4. OpenVINO quantization [ link ]

Resources:

  1. Speeding up Deep Learning with Quantization, link
  2. Why is 8-bit quantization, link
  3. Neural Network Quantization, link
  4. Pruning and Quantization, link
  5. Parameters calculation, link link
  6. Quantization in AI, link
  7. Quantization and deployment of neural networks, link
  8. The 5 Algorithms for Efficient Deep Learning Inference on Small Devices link

--

--

Imran Bangash

Imran is a computer vision and AI enthusiast with a PhD in computer vision. Imran loves to share his experience with self-improvement and technology.