If your pricing models do not run fast enough
Thomas Obitz, MSc, FRM
Market and Model Risk Transformation - Hands-on FRTB and AI Expert
Approximating Pricing Functions by Neural Networks vs RBFs
Some day in 2017, I attended a talk at an industry conference on approximating XVA pricing by neural networks. Having some background in approximation theory, I was wondering how well this would work, and whether this was a good idea. Surprisingly, there were loads of theoretical papers on what neural networks could approximate (well, "everything"), but I did not come across any paper looking at the approximation quality on a real-world pricing function.
This question expanded into a thesis comparing the approximation behaviour of artificial neural networks (“ANNs”) to radial basis functions, the leading multi-dimensional approximation method from functional analysis.
And the results was, as always: you can make it work either way. But with a bit of mathematical insight, RBFs may yield significantly better results with a fraction of the effort. And no – throwing a random neural net of arbitrary topology and activation function at a problem usually is not the best solution.
Why approximation of pricing functions is “en vogue”
XVA calculation – which is notoriously calculation intensive – is not the only reason why approximation is a rather “hot” topic at the moment. FRTB has resulted in an explosion of computational demand in market risk calculation. Furthermore, with given compute capacity, it can be more precise calculating prices with higher precision in some support points and interpolate, rather than running the pricer at lower precision for each point where a price is needed. Or, alternatively: A bit of approximation theory can save many millions in hardware investment or cloud CPU cost.
Approximation – the “classical” way
Most of us have come across polynomial approximation; many of us know that it behaves in a fairly unimpressive way unless we use Chebyshev points as supports. Extending this approach to multiple dimensions does not look very compelling, amongst others because the grid to support the approximation quickly becomes so large that it is faster to price the derivative than to populate every grid point. A few theoretical issues (such as the Meyerhuber—Curtis theorem) get in the way as well. So multi-dimensional approximation requires a more sophisticated approach.
Figure 1: Approximation by radial basis functions
Radial basis functions are the “pocket knife” provided by functional analysis to deal with this issue. Introduced in the 1930ies and booming since the 70ies and 80ies of the last century, they are a powerful tool of approximation in a multi-dimensional space. The idea is as simple as it is intuitively compelling: As shown in the picture, the approximation is supported by a number of smooth “bumps” which collectively approximate the surface (it’s obviously a bit more complicated – and how to optimize this approximation is still an area of active research). Small modifications, such as small moves in support points which I described in my thesis, can improve precision of the method by orders of magnitude.
Approximation by neural networks
We train a neural network either on the outputs of a pricing model, or even on market prices themselves, and we obtain a highly efficient pricing engine producing “good enough” prices. Too good to be true? Indeed. Let’s unpick the complexities, and how to deal with them.
First of all – the activation function. Sigmoid is bad, ReLU is great, but in the end they all do the same, so pick one by trial and error (forgive me, “hyper-parameter grid search” sounds much better), and all is good, right? No. Nothing further from that. The activation function carries the interpolation, and if it is not smooth, neither will be the interpolation. Specifically, a ReLU function (i.e. max(0,x)) is not differentiable in 0, and the neural network will produce nothing but a linear spline. That’s all? And why, then, is everybody so excited about it? Well… if you are doing a grid search, you will use a limited number of iterations, so you will find the activation function which converges fastest, not the one which results in the best fit. Consistent pattern: ReLU converges quickly, but the final result is – not so good. Sigmoid converges slowly, but if you give it time, works much better. You want both? Try Swish.
Figure 2: After 1000 iterations: Swish far ahead
Second – network complexity. A fully connected (“dense”) neural network with 10 by 10 nodes in three layers has more than 30,000 individual parameters you need to calibrate. You only have 500 data points? Good luck. However, using a rather recent result of Mashkar and Poggio, a neural network with a depth aligned with the calculation tree of the pricing function can provide surprisingly good out of sample performance. Magic? No, maths.
Technical challenges add to the complexity. It is always helpful to sanity-check your results – there were constellations in which TensorFlow seemed to degrade to 16 bit precision. And if you are trying to cram a massive tensor operation into the memory of your GPU, a bit of basic linear algebra will come in quite handy.
If you made it up to here, I guess I built enough credibility with you to have a swipe at my pet pieve, the universal approximation theorem. It shows that a one-layer neural network with sigmoid activation can approximate any continuous function at any level of accuracy and is quoted in every presentation on machine learning for pricing at least once (without being used any further). The result is as intuitive as it is meaningless – a sigmoid has values from zero to one, and what Cybenko’s proof does is (more or less) to show that if you string enough of these sigmoids together, you can follow the ups and downs of any arbitrary output function at arbitrary precision. That’s not that impressive.
However, there are much more interesting results, establishing much better bounds – some of them even independent of dimension of the problem! – and for a broad range of activation functions. Very recent results explore the role of depth of the network in out of sample performance of predictions. For me, one of the most exciting findings is the link mentioned above between the structure of the calculation tree of a pricing function and the optimal depth of the network approximating it.
Approximating derivatives pricing functions – the proof of the pudding…
My thesis uses our old friend Black-Scholes as a a toy example. It’s rather smooth, apart from a kink at the strike when vola goes to zero. This kink actually makes it interesting, and both RBFs and neural networks are running into difficulty in that region, as the red spots in Figure 3 show.
Let’s start with the most interpretable results: Approximating Black-Scholes over a two-dimensional price—volatily grid has an MSE of 0.004 using a Gaussian RBF approximation vs 0.013 in the best neural network configuration. Maximum squared error is 0.77 vs 0.58. This gives a win to RBFs on MSE, and almost a draw for maximum error (with some advantage for NNs).
Figure 3: Approximating Black Scholes
Looking at compute performance though, RBFs win hands-down: Finding the optimal lambda (the only configuration parameter of the RBF approximation) takes about two seconds on a standard GPU; fitting an individual approximation takes milliseconds (it’s basically a matrix inversion). In contrast, optimizing the hyper-parameters for the neural network requires several hours, and a few minutes for training an instance.
In higher dimensions, RBF approximation still produces (kind of) reasonable results, with an MSE of 0.03 for a strike/ vola/ time grid, and a (not quite practical) MSE of 0.21 in four dimensions (with the interest rate added). Neural networks were so slow they were not practical on the hardware available.
In a nutshell: In terms of precision, RBFs have an advantage. In terms of speed, they are light years ahead of neural networks.
It is worth mentioning that RBF Networks combine both approaches. Their approximation quality probably is a topic for the next project…
Optimizing results
In terms of approximation quality, there are a number of approaches for improving approximation quality of the RBF approximation by about two digits; that’s something to try once I find a bit of time in my day job.
But more importantly: The role of GPU can’t be over-estimated. A mid-range GPU (NVIDIA 2070S) provides a two to four times accelleration to training a neural net (compared to an Intel i7 processor). However, the RBF approximation (which heavily depends on matrix operations) gain a factor of 50 – 60. The effort for porting the Python code of the RBF approximation to the GPU (using cuPY) was about one afternoon, including all memory optimizations and slicing of the matrices to fit into the GPU memory. That was probably less than what it took me to convince Tensorflow to cooperate with my GPU…
In summary
Yes, neural networks are hot. And indeed, they can be used as approximators without much knowledge of approximation techniques. However, classical methods of functional analysis will often perform better and be way more compute effective in accelerating a pricing or risk management platform. So it is a good idea trying out these methods before investing into more hardware. Or talking to an expert. Looking forward to hearing from you...
Further Reading
Thomas Obitz, Multivariate approximation using radial basis functions vs using artificial neural networks with specific attention to derivatives pricing, 2020
Mhaskar and Poggio, Deep vs. shallow networks: An approximation theory perspective, 2016