Activation functions in Neural Networks
Image thanks El Pais

Activation functions in Neural Networks

One of the most important blocks in Machine Learning is the activation functions. In this article, we will review the concepts around the activation functions, the use of them and the differences between the most used.

Introduction

Before deep in the activation functions let's define the model around a Neural Network, and its use in Machine Learning.

The main block in a Neural Network is the Neuron, which is a model based on a biological neuron from the human body. As you can see in Graphic 1, the neuron model has five basic components including the inputs and output. Internally the neuron can apply a linear model to each input - multiplying each value per weight - summing all weighted inputs and adding a bias value.

The other component is the Activation Function which I'm going to explain in the next paragraphs.

The last component is the Output which is the result of all the inputs across the Neuron and the activation function.

Thanks to https://tex.stackexchange.com/questions/132444/diagram-of-an-artificial-neural-network

Neural Networks

Based on Neurons we can build a neural network with the Neuron as the main component. With this kind of architecture, it is possible to construct output layers with complex functions, although it depends on Hidden Layers' interactions.

Artificial Neural Network (thanks to VIASAT)

Why use Neural Networks?

To simulate the human brain behavior and develop complex systems based on this Artificial Intelligence, the implementation of Artificial Neural Networks has become the basal stone of complex systems like Natural Language Processing, Autonomous Vehicles, and others trending technologies.

Using a similar model to human neural networks has been possible to built systems with multiple inputs and specialized algorithms that taking advantage of this architecture can take decisions, as a human - with the advantages & disadvantages that a machine has.

What is an activation function?

The main goal of an activation function is to establish an adequate Neuron Output based on the Neuron Inputs, like a switch on/off which depending on inputs values could change its final value and be activated by them. This decision factor, to turn on / off the Neuron is not functional in the real world.

As you can see in Graphic 2, Neural Networks are built over Neurons and as we defined in the first paragraph and has a linear model for inputs, weights, and bias. As the outputs of our Neural Network are based on 'linear' Neurons, we can infer our output only should be a linear combination of inputs - the whole system behaving with a single neuron.

Unfortunately, the real events are not only linear and if we intend to simulate the human brain-behavior it is required to include in our model a non-linear component and so obtain outputs and systems according to reality.

That's the main objective of the activation function, to introduce non-linearities in the Neuron and in this way build a neural model closer to the human model.

Therefore, an activation function is a mathematical equation that is part of the model of an Artificial Neuron and that adds a non-linear component to the relationship between the inputs and outputs of a Neuronal network.

Activation Function features

Although these features do not apply to all known activation functions, this gives us an idea of the use of activation functions and what kind of features are desirables.

Non-linearity

Perhaps is the main feature, although according to literature some linear activation functions are used for specific cases (use of Identity Function)

Range finite

When the activation function has a limited range (i.e from -1 to 1 or from 0 to 1) the gradient descent method used to train Neural Networks is more stable although less efficient.

Continuously-differentiable

In the implementation of the gradient-descent method used to optimize the cost function, the activation function should be continuously-differentiable and so avoid math complexities in its calculation.

Monotonic

When a function is monotonic the derivative is only ascendent or descendent, which implies convexity, enhancing the convergence of Neural Network associated.

Close to zero

When activation function crosses the origin - point 0,0 in the cartesian plane - the initialization of weights and bias could be done with small values; in opposite cases, special care should be taken with the initialization values.

Activation functions

Among the activation functions most used in Machine Learning systems we have the following:

  • Binary

Defined as:

y = 0 for x < 0
y = 1 for x >= 0

The Binary Step Function is an activation function used as switch on-off based on the inputs values. Unfortunately, it is useful only for a single output and also has derivative problems.

No alt text provided for this image
  • Linear

Defined as:

y = cx

A Linear Function activation could be used in a neural network and it allows multiple outputs because its values would depend on inputs values and weights; this kind of function is computationally efficient and easy to use. On the other side, the derivative of the Linear Function is a constant, which is a feature not desirable in Backpropagation.

No alt text provided for this image
  • Sigmoid

Defined as:

f(x) = 1/(1+exp(-x) the function range between (0,1)

The sigmoid function is a non-linear activation function used in binary classification. It is a special case for Softmax activation function - just in binary cases. As we can see in sigmoid graph there is a Range from 0 to 1, and it doesn't cross the origin point.

No alt text provided for this image
  • Tanh

Defined as:

No alt text provided for this image

It's a similar function to sigmoid but have some desirable features like the origin cross and the limited range - from -1 to 1.

Due to huge range in comparison with sigmoid function, its derivative is more deepest than sigmoid derivative, which is valuable because allows appreciate changes for descent gradient optimization.

No alt text provided for this image


  • ReLU

Defined as:

No alt text provided for this image

The ReLu is a non-linear function with a behavior less computative than sigmoid or tanh functions, which enhance the efficiency of Machine Learning algorithms. Its range is from 0 to infinite, which avoid the vanishing gradient performance - it is a problem in Sigmoid or Tanh functions due to implicit asyntothic behavior.

No alt text provided for this image
  • Softmax

Defined as:

Thanks https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d

The Softmax function is a generalization of logistic regression function and is used for multiclass classification problems. As we can see in the graph, the Softmax function takes a set of inputs and gives them a probability based on its original values; currently the output for this function is a vector of probabilites.

This function has a range between 0 and 1 for each one of the vector members, and commonly is used in the last layer of Neural Network in order to compare it with a expected output with format one-hot.

Sources:

https://en.wikipedia.org/wiki/Activation_function

https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/

https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d

要查看或添加评论,请登录

社区洞察

其他会员也浏览了