Scalable thermodynamic second-order optimization
Normal Computing
We build AI systems that natively reason about the real world #semis #mfg #energy.
New preprint from Normal Computing ! Our team (Kaelan Donatella, Sam Duffield , Denis Melanson , Maxwell Aifer , Phoebe Klett , Rajath Salegame , Zachary Belateche , Gavin Crooks , Antonio Martinez , and Patrick Coles ) has posted "Scalable thermodynamic second-order optimization" - introducing a novel approach to accelerate AI training using physics-based hardware. This work was supported by Advanced Research + Invention Agency (ARIA) 's Scaling Compute Programme, which aims to drastically reduce the hardware and energy costs of training AI models by rethinking current computing paradigms.
While second-order methods like K-FAC can train neural networks more efficiently per iteration than first-order methods like SGD, they're held back by computational overhead. In the figure below we show the different time contributions to the K-FAC update on a multi-layer perceptron (MLP) and a transformer (GPT). We see that inversion dominates these, as well as other matrix operations.
Our algorithm consists of accelerating the matrix operations in the K-FAC optimizer. We compute the weight updates by first constructing Kronecker factors (that approximate the curvature matrix of the loss landscape) and then sending them onto a thermodynamic solver to compute the weight updates.
This leads to a reduction in the computational overhead of K-FAC, thus approaching the requirements of a first-order method like SGD (and its variants such as Adam), as shown in the table below.
领英推荐
Naturally, thermodynamic hardware is inherently lower-precision than standard digital hardware. We show that with proper quantization of the matrix involved, the benefits of K-FAC with respect to Adam can largely be preserved. The figure below shows this for both input and output quantization of the K-FAC update.
We also benchmarked the thermodynamic K-FAC optimizer on AlgoPerf, which include workloads such as training a vision transformer on ImageNet and a graph neural network on ogbg-molpcba. This leads to a substantial estimated advantage over standard KFAC as well as first-order baselines in terms of validation metrics per wall-clock time.
To sum up, our approach uses thermodynamic computing to accelerate the matrix operations at K-FAC's core, making it competitive with first-order methods in wall-clock time while preserving its convergence benefits. Our contributions are:
Follow Normal Computing to stay informed on our thermodynamic computing research!
arxiv: https://arxiv.org/abs/2502.08603