Understanding Mixed Precision Training

Mixed precision for training neural networks can reduce training time and memory requirements without affecting model performance

Jonathan Davis

Image for post — Photo by patricia serna on Unsplash

As deep learning methodologies have developed, it has been generally agreed that increasing the size of a neural network improves performance. However, this is at the detriment of memory and compute requirements, which also need to be increased to train the model.

This can be put into perspective by comparing the performance of Google’s pre-trained language model, BERT, at different architecture sizes. In the original paper, Google’s researchers reported an average GLUE score of 79.6 for BERT-Base and 82.1 for BERT-Large. This small increase of 2.5 came with an extra 230M parameters (110M vs. 340M)!

As a rough calculation, if each parameter is stored in single precision (more detail below), which is 32 bytes of information, then 230M parameters is equivalent to 0.92Gb in memory. This may not seem too large in and of itself, but consider that during each training iteration these parameters goes through a series of steps of matrix arithmetics, calculating further values, such as gradients. All these extra values can quickly become unmanageable.

In 2017, a group of researchers from NVIDIA released a paper detailing how to reduce the memory requirements of training neural networks, using a technique called Mixed Precision Training:

We introduce methodology for training deep neural networks using half-precision floating point numbers, without losing model accuracy or having to modify hyperparameters. This nearly halves memory requirements and, on recent GPUs, speeds up arithmetic.

In this article, we will explore mixed-precision training, understanding both how it fits into the standard algorithmic framework of deep learning and how it is able to reduce computational demand without affecting model performance.

Floating Point Precision

The technical standard used for representing floating-point numbers in binary formats is IEEE 754, established in 1985 by the Institute of Electrical and Electronics Engineering.

As set out in IEEE 754, there are various levels of floating-point precision, ranging from binary16 (half-precision) to binary256 (octuple-precision), where the number after “binary” equals the number of bits available for representing the floating-point value.

Unlike integer values, where the bits simply represent the binary form of the number, perhaps with a single bit reserved for the sign, floating-point values also need to consider an exponent. Therefore, the representation of these numbers in binary form is more nuanced and can significantly affect precision.

Historically, deep learning has used single-precision (binary32, or FP32) to represent parameters. In this format, one bit is reserved for the sign, 8 bits for the exponent (-126 to +127) and 23 bits for the digits. Half-precision, or FP16, on the other hand, reserves one bit for the sign, 5 bits for the exponent (-14 to +14) and 10 for the digits.

However, this comes at a cost. The smallest and largest positive, normal values for each are as follows:

As well as this, smaller denormalized numbers can be represented, where all the exponent bits are set to zero. For FP16, the absolute limit is 2^(-24) However, as denormalized numbers get smaller, the precision decreases.

We will not go into any further depth here to understand the quantitative limitations of different floating-point precision in this article, but the IEEE provides comprehensive documentation for further investigation.

Mixed Precision Training

During standard training of neural networks FP32 to represent model parameters at the cost of increased memory requirements. In mixed-precision training, FP16 is used instead to store the weights, activations and gradients during training iterations.

However, as we saw above this creates a problem, as the range of values that can be stored by FP16 is smaller than FP32, and precision decreases as number become very small. The result of this would be a decrease in the accuracy of the model, in line with the precision of the floating-point values calculated.

To combat this, a master copy of the weights is stored in FP32. This is converted into FP16 during part of each training iteration (one forward pass, back-propagation and weight update). At the end of the iteration, the weight gradients are used to update the master weights during the optimizer step.

Here, we can see the benefit of keeping the FP32 copy of the weights. As the learning rate is often small, when multiplied by the weight gradients they can often be tiny values. For FP16, any number with magnitude smaller than 2^(-24) will be equated to zero as it cannot be represented (this is the denormalized limit for FP16). Therefore, by completing the updates in FP32, these update values can be preserved.

The use of both FP16 and FP32 is the reason this technique is called mixed-precision training.

Loss Scaling

Although mixed-precision training solved, in the most part, the issue of preserving accuracy, experiments showed that there were cases where small gradient values occurred, even before being multiplied by the learning rate.

The NVIDIA team showed that, although values below 2^-27 were mainly irrelevant to training, there were values in the range [2^-27, 2^-24) which were important to preserve, but outside of the limit of FP16, equating them to zero during the training iteration. This problem, where gradients are equated to zero due to precision limits, is known as underflow.

Therefore, they suggest loss scaling, a process by which the loss value is multiplied by a scale factor after the forward pass is completed and before back-propagation. The chain rule dictates that all the gradients are subsequently scaled by the same factor, which moved them within the range of FP16.

Once the gradients have been calculated, they can then be divided by the same scale factor, before being used to update the master weights in FP32, as described in the previous section.

In the NVIDIA “Deep Learning Performance” documentation, the choice of scaling factor is discussed. In theory, there is no downside to choosing a large scaling factor, unless it is large enough to lead to overflow.

Overflow occurs when the gradients, multiplied by the scaling factor, exceed the maximum limit for FP16. When this occurs, the gradient becomes infinite and is set to NaN. It is relatively common to see the following message appear in the early epochs of neural network training:

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to…

In this case, the step is skipped, as the weight update cannot be calculated using an infinite gradient, and the loss scale is reduced for future iterations.

Automatic Mixed Precision

In 2018, NVIDIA released an extension for PyTorch called Apex, which contained AMP (Automatic Mixed Precision) capability. This provided a streamlined solution for using mixed-precision training in PyTorch.

In only a few lines of code, training could be moved from FP32 to mixed precision on the GPU. This had two key benefits:

Reduced training time — training time was shown to be reduced by anywhere between 1.5x and 5.5x, with no significant reduction in model performance.
Reduced memory requirements — this freed up memory to increase other model elements, such as architecture size, batch size and input data size.

As of PyTorch 1.6, NVIDIA and Facebook (the creators of PyTorch) moved this functionality into the core PyTorch code, as torch.cuda.amp. This fixed several pain points surrounding the Apex package, such as version compatibility and difficulties in building the extension.

Although this article will not go any further into the code implementation of AMP, examples can be seen in the PyTorch documentation.

Conclusion

Although floating-point precision is often overlooked, it plays a key role in the training of deep learning models, where small gradients and learning rates multiply to create gradient updates that require more bits to be precisely represented.

However, as state-of-the-art deep learning models push the boundaries in terms of task performance, architectures grow and precision has to be balanced against training time, memory requirements and available compute.

Therefore, the ability of mixed precision training to maintain performance whilst essentially halving the memory usage is a significant advance in deep learning!