Free _verified_ - Loss Scaling
return scaled_loss
While the large gradient values are clipped (handled by gradient clipping), the small gradient values pose a different threat. During backpropagation, many gradients are tiny. In FP16, if a gradient value falls below that $\approx 6 \times 10^-5$ threshold, it doesn't just get rounded—it becomes . loss scaling free
The key enabler is :
Many have observed when switching from FP16 + loss scaling to BF16 loss scaling free, while gaining: return scaled_loss While the large gradient values are
Loss scaling is a widely used technique in deep learning to stabilize and accelerate the training process of neural networks. In this guide, we will explore the concept of loss scaling, its benefits, and how to implement it in your own projects. many gradients are tiny. In FP16