Reasoning behind iteration count in machine training
Deepak Kumar
Propelling AI To Reinvent The Future ||Author|| 150+ Mentorship|| Leader || Innovator || Machine learning Specialist || Distributed architecture | IoT | Cloud Computing
Why to read?
We need iterations when the data is too big which happens all the time in machine learning and we can’t pass all the data to the computer at once.
So, to overcome this problem we need to divide the data into smaller sizes and give it to our computer one by one and update the weights of the neural networks at the end of every step to fit it to the data given.
Is it only reason to divide training data or there is something more interesting and scientific? This document helps in this regard.
Technical explanation
Let's take the two extremes, on one side each gradient descent step is using the entire dataset. You're computing the gradients for every sample. In this case you know exactly the best directly towards a local minimum. You don't waste time going the wrong direction. So in terms of numbers gradient descent steps, you'll get there in the fewest.
Of course computing the gradient over the entire dataset is expensive. So now we go to the other extreme. A batch size of just 1 sample. In this case the gradient of that sample may take you completely the wrong direction. But hey, the cost of computing the one gradient was quite trivial. As you take steps with regard to just one sample you "wander" around a bit, but on the average you head towards an equally reasonable local minimum as in full batch gradient descent.
Above mentioned both approaches have demerits which we will understand below.
Speed tradeoffs in machine learning
This info is important for understanding further readings.
- Computational speed :- CPU cycles used
- Speed of convergence of an algorithm :- Time taken to train machine with good prediction accuracy
Items affecting iteration count
- The higher the batch size, the more memory space you'll need. You can't keep all training data in single batch and so, you need more iteration than one.
- The large batch tend to problem of poorer generalisation. In other words, prediction accuracy reduces (Refer this paper)
- Smaller batch results in more computational cycles and so, increases training time
- When using a smaller batch size, calculation of the error can have more noise compared to larger batch size. However this noise can help the algorithm jump out of a bad local minimum and have more chance of finding either a better local minimum, or hopefully the global minimum.
Invalid choices
- Batch size = training data size. In this case, single iteration is needed, but the training will be very bad
- Iteration count = training sample size. In this case, too many iterations will be needed resulting in high training CPU cycles
What is right choice?
Considering above observations, optimal batch size is important. Smaller batch size is better considering convergence speed and accuracy.
In general, batch size of 32 is a good starting point, and you should also try with 64, 128, and 256. Other values (lower or higher) may be fine for some data sets, but the given range is generally the best to start experimenting with.
And, in the end, make sure the batch fits in the CPU/GPU
Point to remember
EPOCH is different from Iteration.
One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.
Reference
Thanks to these helping hands
https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9 https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks https://arxiv.org/abs/1609.04836 https://stats.stackexchange.com/questions/164876/what-is-the-trade-off-between-batch-size-and-number-of-iterations-to-train-a-neu https://ai.stackexchange.com/questions/8560/how-do-i-choose-the-optimal-batch-size