Stuck behind the paywall? Click here to read the full story with my friend link!
This article is actually a continuum of a series that focuses on the basic understanding of the building blocks of Deep Learning. Some of the previous articles are, in case you need to catch up:
Want to Optimize your Model? Use Learning Rate Decay!
Adapting your Learning Rate Parameter with time can make a huge difference! Let’s see how.
Want your model to converge faster? Use RMSProp!
This is another technique used to speed up Training.
Deep Learning is the branch of Artificial Intelligence where we let the model learn features on its own to get to a result. We don’t hard code any logic, or any algorithm, it automatically tries different relationships between features and chooses the best set of relationships that support the right prediction.
Now, we don’t actually let the model try blindly, but we have a couple of hyperparameters. To explain Hyper parameters, we can take an example of a child playing on an IPad. You cannot control what he/she plays or watches on it, but you sure can control how much time the kid spends on it. In a similar way, we can tune these hyper parameters to make the model rate the features on the basis of which it produces the predicted results.
Now, working with Deep Learning and Machine learning in general, you need to take care of a lot of hyper parameters and tuning them can be a huge hassle.
This post implies for any Hyper parameter, they can be:
- Learning rate: å
- Momentum: ß
- Adam Optimizer (ß1, ß2, ∑)
- number of layers
- number of hidden units
- Learning rate decay
- mini-batch size
Alpha, Learning Rate, is the most important amongst these. Beta, Momentum, number of hidden units, and mini batch size following Alpha are important.
How do you choose these?
In early days of Machine Learning, people used to use grids of tables to choose and try different values for these hyper parameters.
Well, this approach is okay until we have a small number of hyper parameters that we have to tune, but when we have a larger number of hyper parameters, this approach will take a lot more time than is required.
We try randomly chosen values! Theres a basic reason to do so. You don’t know what value of which hyper parameter will work better with the type of problem you’re trying to solve.
How Machine Learning and Artificial Intelligence Changing the Face of eCommerce? | Data Driven…
The eCommerce development company, nowadays, integrating advancement to take customer experience to the next level…
Coarse to fine
Another good approach is “Coarse to fine” scheme.
What this approach is that suppose we have a randomly set grid of values, and after getting some results, we find out that values in a particular region are performing better than others, than we zoom into that region and perform deeper analysis to eventually get the best set of hyper parameters!
But on what scale are you tuning your hyper parameters?
While randomly discovering values in the tuning precess, its important to search values in the right scale.
The upper one is when we are using 90% of the values from the scale between 0.1 and 1 and the lower one is when we are dividing and giving each scale its place.
R = -4 * np.random.rand()
å = 10^r
Here ‘np’ is numpy. r’s value would be anywhere between -4 and 0. Hence, å will be anywhere between 10^-4 to 1.
In this article, we discussed how using a constant Learning Rate can make our model perform worse, and hence, we discussed a technique widely used by Machine Learning Practitioners to tackle this problem and to change the Learning Rate according to the need for the model.