The Approximation Power of Neural Networks (with Python codes)
Picture by sdecoret/shutterstock.com

The Approximation Power of Neural Networks (with Python codes)

Why neural networks can predict the outcomes of (almost) any process

Introduction

It is a well-known fact that neural networks can approximate the output of any continuous mathematical function, no matter how complicated it might be. Take for instance the function below:

No alt text provided for this image

Though it has a pretty complicated shape, the theorems we will discuss shortly guarantee that one can build some neural network that can approximate f(x) as accurately as we want. Neural networks, therefore, display a type of universal behavior.

One of the reasons neural networks have received so much attention is that in addition to these rather remarkable universal properties, they possess many powerful algorithms for learning functions.

Universality and the underlying mathematics

This piece will give an informal overview of some of the fundamental mathematical results (theorems) underlying these approximation capabilities of artificial neural networks.

“Almost any process you can imagine can be thought of as function computation… [Examples include] naming a piece of music based on a short sample of the piece […], translating a Chinese text into English […], taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting.”
— Michael Nielsen

A motivation for using neural nets as approximators: Kolmogorov’s Theorem

In 1957, the Russian mathematician Andrey Kolmogorov, known for his contributions to a wide range of mathematical topics (such as probability theory, topology, turbulence, computational complexity, and many others), proved an important theorem about the representation of real functions of many variables. According to Kolmogorov’s theorem, multivariate functions can be expressed via a combination of sums and compositions of (a finite number of) univariate functions.

The Russian mathematician Andrey Kolmogorov (Wikipedia)

A little more formally, the theorem states that a continuous function f of real variables defined in the n-dimensional hypercube (with n ≥ 2) can be represented as follows:

Kolmogorov's Theorem (1957)

In this expression, the gs are univariate functions and the

No alt text provided for this image

are continuous, entirely (monotonically) increasing functions (as shown in the figure below) that do not depend on the choice of f.

Example of monotonous function.

Universal Approximation Theorem (UAT)

The UAT states that feed-forward neural networks containing a single hidden layer with a finite number of nodes can be used to approximate any continuous function provided rather mild assumptions about the form of the activation function are satisfied. Now, since almost any process we can imagine can be described by some mathematical function, neural networks can, at least in principle, predict the outcome of nearly every process.

There are several rigorous proofs of the universality of feed-forward artificial neural nets using different activation functions. Let us, for the sake of brevity, restrict ourselves to the sigmoid function. Sigmoids are “S”-shaped and include as special cases the logistic function, the Gompertz curve, and the ogee curve.

Python code for the sigmoid

A quick Python code to build and plot a sigmoid function is:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
upper, lower = 6, -6
num = 100
def sigmoid_activation(x):
    if x > upper:
        return 1
    elif x < lower:
        return 0
    return 1/(1+np.exp(-x))
vals = [sigmoid_activation(x) for 
       x in np.linspace(lower, upper, num=num)]
plt.plot(np.linspace(lower, 
                     upper, 
                     num=num), vals);
plt.title('Sigmoid');


A plot of the sigmoid function. In the code, the conditionals inside the function are included to avoid problematic computations at large values.

George Cybenko’s Proof

The proof given by Cybenko (1989) is known for its elegance, simplicity, and conciseness. In his article, he proves the following statement. Let

No alt text provided for this image

be any continuous function of the sigmoid type (see discussion above). Given any multivariate continuous function

No alt text provided for this image

inside a compact subset of the N-dimensional real space and any positive ?, there are vectors

No alt text provided for this image

(the weights), constants

No alt text provided for this image

(the bias terms) and

No alt text provided for this image

such that

No alt text provided for this image

for any x (the NN inputs) inside the compact subset, where the function G is given by:

No alt text provided for this image

Choosing the appropriate parameters, neural nets can be used to represent functions with a wide range of different forms.

An example using Python

To make these statements less abstract, let us build a simple Python code to illustrate what has been discussed until now. The following analysis is based on Michael Nielsen’s great online book and on the exceptional lectures by Matt Brems and Justin Pounders at the General Assembly Data Science Immersive (DSI).

We will build a neural network to approximate the following simple function:

No alt text provided for this image

A few remarks are needed before diving into the code:

  • To make the analysis more straightforward, I will work with a simple limit case of the sigmoid function. When the weights are extremely large, the sigmoid approaches the Heaviside step function. Since we will need to add contributions coming from several neurons, it is much more convenient to work with step functions instead of general sigmoids.
No alt text provided for this image
  • In the limit where the sigmoid approaches the step function, we need only one parameter to describe it, namely, the point where the step occurs. The value of s can be shown to be equal to s=-b/w where b and w are the neuron’s bias and weight respectively.
  • The neural network I will build will be a very simple one, with one input, one output, and one hidden layer. If the values of the weights corresponding to two hidden neurons are equal in absolute value and with opposite signs, their output becomes a “bump” with the height equal to the absolute value of the weights and width equal to the difference between the values of s of each neuron (see the figure below).
No alt text provided for this image
  • We use the following notation since the absolute value of the weights is the height of the bump.
No alt text provided for this image
  • Since we want to approximate g, the weighted output of the hidden layer must involve the inverse of the sigmoid. In fact, it must be equal to:
No alt text provided for this image

To reproduce this function, we choose the values of hs to be (see the corresponding figure below, taken from here):

No alt text provided for this image


No alt text provided for this image

The code

The code starts with the following definitions:

  • We first import inversefunc that will need to build the inverse sigmoid function
  • We then choose a very large weight for the sigmoid for it to be similar to the Heaviside function (as just discussed).
  • We choose the output activation to be the identity functionidentify_activation
  • The role of the function is to recover the original (w,b) parametrization from s and w (recall that s is the step position).
  • The architecture is set, and all the ws and bs are chosen. The elements of the array weight_outputs are obtained from the values of the output weights given in the previous section.
from pynverse import inversefunc
w = 500
def identity_activation(x):
    return(x)
def solve_for_bias(s, w=w):
    return(-w * s)
steps = [0,.2,.2,.4,.4,.6,.6,.8,.8,1]
bias_hl = np.array([solve_for_bias(s) for s in steps])
weights_hl = np.array([w] * len(steps))
bias_output = 0
weights_output =np.array([-1.2, 1.2, -1.6, 1.6, 
                          -.3, .3, -1])
                          1, 1, 1, -1])

The final steps are:

  • Writing a Python function which I called nn that builds and runs the network
  • Printing out the comparison between the approximation and the actual function.
def nn(input_value):
    
    Z_hl = input_value * weights_hl + bias_hl
    activation_hl = np.array([sigmoid_activation(Z) 
                              for Z in Z_hl])
Z_output = np.sum(activation_hl * weights_output) 
               + bias_output
activation_output = identity_activation(Z_output) 
    
    return activation_output
x_values = np.linspace(0,1,1000)
y_hat = [nn(x) for x in x_values]
def f(x):
    return 0.2 + 0.4*(x**2) + 0.3*x*np.sin(15*x)+ 0.05*np.cos(50*x))
y = [f(x) for x in x_values]
inv_sigmoid = inversefunc(sigmoid_activation)
y_hat = [nn(x) for x in x_values]
y_invsig = [inv_sigmoid(i) for i in y]
_ = plt.plot(x_values, y_invsig)
_ = plt.plot(x_values, y_hat)
_ = plt.xlim((0,1))

No alt text provided for this image

This approximation is far from ideal. However is it straightforward to improve it, for example, by increasing the number of nodes or the number of layers (but at the same time avoiding overfitting).

Conclusion

In this article, I described some basic mathematics underlying the universal properties of neural networks, and I showed a simple Python code that implements an approximation of a simple function.

Though the full code is already included in the article, my Github and my personal website www.marcotavora.me have (hopefully) some other interesting stuff both about data science and about physics.

Thanks for reading and see you soon!

By the way, constructive criticism and feedback are always welcome!

This article was originally published on Towards Data Science.

要查看或添加评论,请登录

Marco Tavora Ph.D.的更多文章

  • The Worst Theoretical Prediction in the History of Physics

    The Worst Theoretical Prediction in the History of Physics

    Dark Energy, and One of the Most Famous Unsolved Problem in Physics This is an article I have just published on Medium…

    1 条评论
  • A No-Nonsense Explanation of How the Higgs Boson Gives Particles Their Masses

    A No-Nonsense Explanation of How the Higgs Boson Gives Particles Their Masses

    A Simplified Explanation of the Higgs Mechanism Physicists have developed a powerful framework for our understanding of…

    9 条评论
  • Fermat’s Last Theorem

    Fermat’s Last Theorem

    A Simple Proof of a Special Case Pierre de Fermat famously conjectured he had a beautiful proof of a conjecture now…

    6 条评论
  • The Fundamental Theorem of Algebra

    The Fundamental Theorem of Algebra

    Using Complex Numbers to Prove That All Polynomial Functions Have Roots This is a brand new article published on…

    7 条评论
  • Why Black Holes Emit Light

    Why Black Holes Emit Light

    A Simplified Explanation of the Physics Underlying the Hawking Radiation In this article, I follow Mukhanov and…

    1 条评论
  • Archimedes and the Integral Calculus

    Archimedes and the Integral Calculus

    How the Great Greek Mathematician Calculated the Area Inside a Parabola In this article, I describe how Archimedes…

    12 条评论
  • The Calculus According to Leibniz

    The Calculus According to Leibniz

    How Leibniz Derived the Famous Formula for Integration by Parts Gottfried Wilhelm Leibniz was the quintessential…

    4 条评论
  • Einstein’s Gravity Theory and the Bending of Light by the Sun

    Einstein’s Gravity Theory and the Bending of Light by the Sun

    A Proof of the Gravitational Deflection of Light by the Sun The general theory of relativity (published in 1915), which…

    4 条评论
  • The Genius of Archimedes

    The Genius of Archimedes

    How He Derived the Volume of a Sphere Using Basic Physics In this article, I describe how Archimedes derived the volume…

    4 条评论
  • How to Calculate π by Tossing Needles on the Floor

    How to Calculate π by Tossing Needles on the Floor

    The Famous Buffon’s Needle Problem In this article, I describe the famous Buffon's needle problem: "Consider a needle…

    3 条评论

社区洞察

其他会员也浏览了