Deep Learning

Insights to How to Tune Your Hyper Parameters

A guide to tuning your Hyper Parameters to get the best accuracy.

Stuck behind the paywall? Click here to read the full story with my friend link!

This article is actually a continuum of a series that focuses on the basic understanding of the building blocks of Deep Learning. Some of the previous articles are, in case you need to catch up:

Deep Learning is the branch of Artificial Intelligence where we let the model learn features on its own to get to a result. We don’t hard code any logic, or any algorithm, it automatically tries different relationships between features and chooses the best set of relationships that support the right prediction.

Now, we don’t actually let the model try blindly, but we have a couple of hyperparameters. To explain Hyper parameters, we can take an example of a child playing on an IPad. You cannot control what he/she plays or watches on it, but you sure can control how much time the kid spends on it. In a similar way, we can tune these hyper parameters to make the model rate the features on the basis of which it produces the predicted results.

Now, working with Deep Learning and Machine learning in general, you need to take care of a lot of hyper parameters and tuning them can be a huge hassle.

Hyper Parameters

This post implies for any Hyper parameter, they can be:

  • Learning rate: å
  • Momentum: ß
  • Adam Optimizer (ß1, ß2, ∑)
  • number of layers
  • number of hidden units
  • Learning rate decay
  • mini-batch size

Learning Rate

Alpha, Learning Rate, is the most important amongst these. Beta, Momentum, number of hidden units, and mini batch size following Alpha are important.

How do you choose these?

In early days of Machine Learning, people used to use grids of tables to choose and try different values for these hyper parameters.

Image by Author

Well, this approach is okay until we have a small number of hyper parameters that we have to tune, but when we have a larger number of hyper parameters, this approach will take a lot more time than is required.

What then?

We try randomly chosen values! Theres a basic reason to do so. You don’t know what value of which hyper parameter will work better with the type of problem you’re trying to solve.

Coarse to fine

Another good approach is “Coarse to fine” scheme.

What this approach is that suppose we have a randomly set grid of values, and after getting some results, we find out that values in a particular region are performing better than others, than we zoom into that region and perform deeper analysis to eventually get the best set of hyper parameters!

before
after

But on what scale are you tuning your hyper parameters?

While randomly discovering values in the tuning precess, its important to search values in the right scale.

Image by Author

The upper one is when we are using 90% of the values from the scale between 0.1 and 1 and the lower one is when we are dividing and giving each scale its place.

Implementing

R = -4 * np.random.rand()
å = 10^r

Here ‘np’ is numpy. r’s value would be anywhere between -4 and 0. Hence, å will be anywhere between 10^-4 to 1.

Conclusion

In this article, we discussed how using a constant Learning Rate can make our model perform worse, and hence, we discussed a technique widely used by Machine Learning Practitioners to tackle this problem and to change the Learning Rate according to the need for the model.

Contacts

If you want to keep updated with my latest articles and projects, follow me on Medium. These are some of my contacts details:

Happy Learning. :)

Written by

Machine Learning Enthusiast | Quick Learner | Student

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store