Everything you need to know about the activation functions in Deep learning!
What is an Activation Function?
In the neural network activation function plays an important role. The activation function determines the output of a deep learning network, its accuracy, and computational efficiency of training the network.
Activation Function help in normalizing the output between 0 to 1 or -1 to 1. It helps in the process of backpropagation due to their differentiable property. During backpropagation, the loss function gets updated, and activation function helps the gradient descent curves to achieve their local minima.
Neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.
In this article, I’ll discuss the various types of activation functions present in a neural network.
1. Sigmoid function
Sigmoid is a non-linear activation function. Also known as the Logistic function. It is continuous and monotonic. The output is normalized in the range 0 to 1. It is differentiable and gives a smooth gradient curve. Sigmoid is mostly used before the output layer in binary classification.
Advantages of Sigmoid Function : -
- Smooth gradient, preventing “jumps” in output values.
- Output values bound between 0 and 1, normalizing the output of each neuron.
Sigmoid major disadvantages:
- Prone to gradient vanishing
- Function output is not zero-centered
2. tanh function
Hyperbolic tangent activation function value ranges from -1 to 1, and derivative values lie between 0 to 1. It is zero centric. Performs better than sigmoid. They are used in binary classification for hidden layers.
The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.
3. ReLU function
Rectified Linear Unit is the most used activation function in hidden layers of a deep learning model. The formula is pretty simple, if the input is a positive value, then that value is returned otherwise 0. Thus the derivative is also simple, 1 for positive values and 0 otherwise(since the function will be 0 then and treated as constant so derivative will be 0). Thus it solves the vanishing gradient problem. The range is 0 to infinity.
Advantages of ReLU Function : -
- When the input is positive, there is no gradient saturation problem.
- The calculation speed is much faster.
- The ReLU function has only a linear relationship.
Disadvantages of ReLU Function : -
- ReLU function is not a 0-centric function.
- When the input is negative, ReLU is completely inactive, which means that once a negative number is entered, ReLU will die. In forward propagation process, it is not a problem but in backpropagation process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the sigmod function and tanh function.
4. Leaky ReLU function
LeakyReLU is a slight variation of ReLU. For positive values, it is same as ReLU, returns the same input, and for other values, a constant 0.01 with input is provided. This is done to solve the dying ReLu problem. The derivative is 1 for positive and 0.01 otherwise.
Leaky ReLU has all the advantages of ReLU. There will be no problems with Dead ReLU.
5. ELU (Exponential Linear Units) function
Exponential Linear Unit overcomes the problem of dying ReLU. Quite similar to ReLU except for the negative values. This function returns the same value if the value is positive otherwise, it results in alpha(exp(x) — 1), where alpha is a positive constant. The derivative is 1 for positive values and product of alpha and exp(x) for negative values. The Range is 0 to infinity. It is zero centric.
- No Dead ReLU issues
- The mean of the output is close to 0, zero-centered
Problem with ELU- slightly more computationally intensive.
6. PRelu (Parametric ReLU)
Parameterized Rectified Linear Unit is again a variation of ReLU and LeakyReLU with negative values computed as alpha*input. Unlike Leaky ReLU where the alpha is 0.01 here in PReLU alpha value will be learnt through backpropagation by placing different values and the will thus provide the best learning curve.
- if aᵢ=0, f becomes ReLU
- if aᵢ>0, f becomes leaky ReLU
- if aᵢ is a learnable parameter, f becomes PReLU
7. Swish (A Self-Gated) Function
Swish is a kind of ReLU function. It is a self-grated function single it just requires the input and no other parameter. Formula y = x * sigmoid(x). Mostly used in LSTMs. Zero centric and solves the dead activation problem. Has smoothness which helps in generalisation and optimisation.
y = x * sigmoid (x)
Disadvantage- High computational power and only used when the neural network has more than 40 layers.
8. Softmax function
Softmax activation function returns probabilities of the inputs as output. The probabilities will be used to find out the target class. Final output will be the one with the highest probability. The sum of all these probabilities must be equal to 1. This is mostly used in classification problems, preferably in multiclass classification.
Softmax will not work for linearly separable data
9. Softplus function
Finding the derivative of 0 is not mathematically possible. Most activation functions have failed at some point due to this problem. It is overcome by softplus activation function. Formula y = ln(1 + exp(x)). It is similar to ReLU. Smoother in nature. Ranges from 0 to infinity.
Disadvantage- Due to its smoothness and unboundedness nature softplus can blow up the activations to a much greater extent.
Activation functions through dance moves:
This blog is just to help you to get started with various activation functions. If you have any queries, feel free to comment down your queries.