Hello Everyone! This is my fourth writing since I’ve started my Nanodegree and have successfully completed the second module of building a Neural Network from scratch, but more on that later. First lets discuss about today’s leanrings.
Todays main topic was Overfitting vs Underfitting. Overfitting is when a model trains on a given data set well but fails to generalize the accuracy on a test data, unseen data. Underfitting happens when a model fails to come up with a model that can tackle the complexity of a data set. The sad part is that theres actually no straight way to get the get best which avoids both of these phenomenas.
To quote an example for Underfitting, Its just like you, trying a kill a bear with mosquito killer, see? The solution is not enough. On the other hand, Overfitting is when you try to kill a mosquito with a bazooka. See what happened there? You used a complex solution for a rather simple problem.
Now, this is a technique used to avoid Overfitting. What we do in this process, is that we change the error function and add either the sum of absolute values of the weights times a constant or sum of sqaured values of the weights times a constant. The constant is called Lambda.
Difference between adding squared weights or absolutes, you must ask. We call the process L1 Regularization when we add the absolute values of the weights. This is useful when we are aiming to choose features as it reduces the data set by setting very small values to zero. Doing this short and quick fix on the data can save us a ton of computing power. The other type of Regularization is L2 Regularization, in it, we have to sum the square of the weights. This type is useful when we want to increase the accuracy of the model by some amount, as it scales the data to a smaller value.
The effect of Lambda is that if we have it set to a large value, we penalize the outliers in the data with a higher extent and we penalize them in a sharter manner when we keep the Lambda small.
This is another way we can increase the efficiency of our model. Dropout is actually the probability of all the nodes going null for a period of time. Lets suppose we have it as 0.2, what this means is that all the nodes have 20% chance of being ignored in one epoch. And the selection of the nodes which are ignored in the epoch is completely random to avoid a biased model.
To understand its perk, we have to suppose that we have a set of input nodes. The first node has weight of 18 and the other three have values between 0 and 1. What will happen is the next layer will almost be only influenced by the node of most weight. Because 1 of that node will add 18 to the answer as compared to others which will add either 0 or 1. Hence, the model will be unable to generalize an algorithm that keeps in view all the data. Hence, we use the concept of Dropout here. And after setting it, what we will observe is that nodes randomly go off in epochs in order to prioritize the effect of others.
The Vanishing Gradient
This is another problem that arises when using Gradient Descent. Lets recap of what Gradient Descent is, we find the shortest steps that will lead to out desired output and to understand this problem, lets recap the activation function, Sigmoid.
This is the graph of Sigmoid Function. It is evident that when we have the values of ‘x’ that are very large, i.e. 4+, or very small, i.e lesser than -4. We can see that the gradient of the line almost becomes 0.
And remember that the gradient approaches to zero. And you should also recall that the steps that we take in order to minimize the error, are actually the gradients of the lines on particular x’s. Now, if we have either pretty big values or very small values, then the steps that we take to become smaller and smaller which will require more time to train the model. And this is the problem, we can have such small steps to move in case of extreme values and this is pretty often to happen.
There are a number of ways in which we can tackle this problem, but the easiest is to change the activation function to RELU.
REctified Linear Unit (RELU)
This activation function is pretty simple. All it does is, for negative values, it return 0 and for positive values, return the value itself. Believe me or not, this subtle change in the model can bring drastic change in the outcome of the model. Why is it useful you might ask, it is extremely useful as the derivative of all the nodes is 1. Hence, we can have a pretty balanced model while having this activation function.
Stochastic Gradient Descent
This is a simple technique in which the training set is divided into batches and each batch goes through the Neural Network, the forward feed, the backpropagation, the updation of weights. And after all the batches, we will have the same output as when we input all the data at once. The advantage of using this is that this also saves the computing power and computing small sets is easier for the computer than to train on one huge set.
Lets discuss another problem that can be faced which using Gradient Descent, Local minimums. What happens is that the Algorithm unknowingly decreases the error but in doing do, it gets trapped in a local minimum. Now, here, the model is okay but does not has the true accuracy.
And to get over this problem, we use Momentum. It is essentially the process of using the previously taken steps to determine the next move. How this works is that we multiply the previous steps with increasing values of a negative powered constant, to make the older steps matter lesser and lesser. And falling into a local minimum, the steps to fall into it, are usually long leaps. Hence, due to these long steps the next steps are also bigger and hence, we can get over the hurdle of the local minimum.
This was what I could cover today. I also finished the first project, Predicting Bike Sharing Patterns, and it wasn’t that difficult. I followed the notes that I created while taking the lecture and the project was smooth for me. And this is it for this writing. See you in the next one.