Setup Neural Networks Hyperparameters for Best Results

Setup Neural Networks Hyperparameters for Best Results

1. Number of Hidden Layers

The first hyperparameter in a neural network is the number of hidden layers. For many problems, you can just begin with a single hidden layer and get reasonable results as theoretically, a neural network with one hidden layer can model most of the complex functions if it has enough neurons. However, for more complex cognitive functions, deep neural networks will be much more efficient compared to shallow ones.

Since in deep neural networks, lower hidden layers model low-level structures (e.g., line segments of various shapes and orientations), intermediate hidden layers combine these low-level structures to model intermediate-level structures (e.g., squares, circles), and the highest hidden layers and the output layer combine these intermediate structures to high-level model structures (e.g., faces).

Not only does this hierarchical architecture help DNNs converge faster to a good solution, but it also improves their ability to generalize to new datasets. For example, if you have already trained a model to recognize faces in pictures and you now want to train a new neural network to recognize hairstyles, you can kickstart the training by reusing the lower layers of the first network. Instead of randomly initializing the weights and biases of the first few layers of the new neural network, you can initialize them to the values of the weights and biases of the lower layers of the first network. This way, the network will not have to learn from scratch all the low-level structures that occur in most pictures; it will only have to learn the higher-level structures (e.g., hairstyles). This is called transfer learning.

For more complex problems, you can ramp up the number of hidden layers until you start overfitting the training set. Very complex tasks, such as large image classification or speech recognition, typically require networks with dozens of layers or even hundreds, but not fully connected ones, and they need a huge amount of training data. You will rarely have to train such networks from scratch: it is much more common to reuse parts of a pre-trained state-of-the-art network that performs a similar task. Training will then be a lot faster and require much fewer data.

2. Number of Neurons per Hidden Layer

The number of neurons in the input and output layers is determined by the type of input and output your task requires. For example, the MNIST task requires 28 x 28= 784 input neurons and 10 output neurons since it has ten classes. If the task is regression or binary classification, then the number of output neurons is just one.

As for the hidden layers, it used to be common to size them to form a pyramid, with fewer and fewer neurons at each layer-the rationale being that many low-level features can coalesce into far fewer high-level features.

That said, depending on the dataset, it can sometimes help to make the first hidden layer bigger than the others. Just like the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting. But in practice, it’s often simpler and more efficient to pick a model with more layers and neurons than you actually need, then use?early stopping?and other regularization techniques to prevent it from overfitting.\

In general, you will get better results by increasing the number of layers instead of the number of neurons per layer.

3. Learning Rate

One way to find a good learning rate is to train the model for a few hundred iterations, starting with a very low learning rate (e.g., 10^-5) and gradually increasing it up to a very large value (e.g., 10). This is done by multiplying the learning rate by a constant factor at each iteration (e.g., by exp(log(10?)/500) to go from 10^-5 to 10 in 500 iterations).

If you plot the loss as a function of the learning rate (using a LIP log scale for the learning rate), you should see it drop at first. But after a while, the learning rate will be too large, so the loss will shoot back up:?the optimal learning rate will be a bit lower than the point at which the loss starts to climb (typically about 10 times lower than the turning point). You can then reinitialize the model and train it normally using this good learning rate.

Finally, it is important to remember that the optimal learning rate depends on the other hyperparameters, especially the batch size, so if you modify any of the hyperparameters, remember to update the learning rate as well.

4. Activation Function

An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network. The choice of activation function has a large impact on the capability and performance of the neural network, and different activation functions may be used in different parts of the model.

A very good starting point is to start with?Relu?as an activation function for the hidden layers.

The activation functions of the output layer will depend mainly on the task. For regression tasks, you can use the linear activation function as you would like to output from the fully connected layers without changes.

If your problem is a classification problem, then there are three main types of classification problems, and each may use a different activation function.

  • If there are two mutually exclusive classes (binary classification), then your output layer will have one node, and a?sigmoid activation?function should be used.
  • If there are more than two mutually exclusive classes (multiclass classification), then your output layer will have one node per class, and a?softmax activation?should be used.
  • If there are two or more mutually inclusive classes (multilabel classification), then your output layer will have one node for each class, and a?sigmoid activation?function is used.

So to summarize this:

  • Regression:?One node with linear activation.
  • Binary Classification: One node, sigmoid activation.
  • Multiclass Classification: One node per class, softmax activation.
  • Multilabel Classification: One node per class, sigmoid activation.

5. Batch Size

The batch size can have a significant impact on your model’s performance and training time. You can choose a small batch size such as 32 or 64, or you can use the maximum batch size that will fit in your memory how to decide which to choose, or are there any other options?

The main benefit of using large batch sizes is that hardware accelerators such as GPUs can process them efficiently and the algorithm will see more instances per second which will increase its performance. Therefore, many researchers and practitioners recommend using the largest batch size that can fit in GPU RAM. However, in practice, it was found that large batch sizes often lead to training instabilities, especially at the beginning of training, and the resulting model may not generalize as well as a model trained with a small batch size.

6. Optimizers

An optimizer is a function or an algorithm that modifies the attributes of the neural network, such as weights and learning rates. Therefore, it helps in reducing the overall loss and improving the accuracy. The problem of choosing the right weights for the model is a daunting task, as a deep learning model generally consists of millions of parameters. It raises the need to choose a suitable optimization algorithm for your application.

You can use different optimizers to make changes in your weights and learning rate. However, choosing the best optimizer depends upon the application. One possible solution is to try all the possibilities and choose the one that shows the best results. This might good solution if your data is small, but when dealing with hundreds of gigabytes of data, even a single epoch can take a considerable amount of time.

However being said, if you would like to choose one optimizer, you should choose?adam optimizer. The adam optimizer has several benefits, due to which it is used widely. It is adapted as a benchmark for deep learning papers and recommended as a default optimization algorithm. Moreover, the algorithm is straightforward to implement, has a faster running time, low memory requirements, and requires less tuning than any other optimization algorithm.


7. Loss Function

The purpose of loss functions is to compute the quantity that a model should seek to minimize during training. Importantly, the choice of the loss function is directly related to the?activation function?used in the output layer of your neural network and the expected output and the learning task. The elements are connected with each other.

We use the ????????????_??????????????????????_???????????????????????? loss when we have sparse labels (when for each instance we get a class index, for example, if we have three classes, then labels will be 0,1,2). We use ??????????????????????_???????????????????????? loss when we have one target probability per class instance, for example [0,0,1] for class 3. We use the ????????????_???????????????????????? loss for binary classification tasks.

For regression, we can use?mean_squared_error?or the?mean_absolute_error?function.


8. Number of Epochs

In most cases, the number of epochs or iterations will not be needed to be optimized. You can just use early stopping to stop when there is no any improvement in the performance of the models.

要查看或添加评论,请登录

Amir T.的更多文章

社区洞察

其他会员也浏览了