What is an activation function? What are commonly used activation functions?

Medium Last updated on May 7, 2022, 1:05 a.m.

Activation functions are the non-linearities introduced in neural network architecture to enable complex pattern learning. Without non-linearities, neural networks are essentially linear models. Though there are multiple such non-linear activation functions, some researchers are still working on finding better non-linear functions to help networks converge faster or use fewer layers. Here, we are listing widely used activation functions with their properties and problems.

Markdown Monster icon
Fig 1: Activation Functions Graph

1. Sigmoid Activation Function: Sigmoid Function is one of the special functions in the Deep Learning Field, thanks to its simplification during Back Propagation. As we can see in this image, it:

(a.) Range from [0,1].

(b.) Not Zero Centered.

(c.) Have Exponential Operation (It is Computationally Expensive.)

The Main Problem we face is because of Saturated Gradients, as the Function ranges between 0 to 1, the output might remain constant for some inputs; consequently, the gradient flow reduces significantly. In simpler words, if the sigmoid layer starts giving constant output, there will be no change observed in parameters during gradient descent.

2. Hyperbolic Tangent Activation Function(tanh): Hyperbolic Tangent also have the following properties:

(a.) Ranges Between [-1,1]

(b.) Zero Centered

tanh can be considered an excellent example of a case when input >0, so the gradients we will obtain will be all positive or negative. This same sign gradient can lead to explosion or vanishing gradient issues; thus, usage of tanh can be a good thing, but this still faces the problem of Saturated Gradients.

3. Rectified Linear Unit Activation Function (ReLU): ReLU is the most commonly used activation function, because of its simplicity during backpropagation and computationally inexpensive nature. It has the following properties:

(a.) It does not Saturate.

(b.) It converges faster than some other activation functions.

However, we can face an issue of dead ReLU, for example, if:
$$ w>0, x<0. So, ReLU(w*x)=0. $$ This means there will be no learning as the forward pass will give 0 as output.

4. Leaky ReLU: Leaky ReLU can be used as an improvement over the ReLU Activation function. It has all properties of ReLU, plus it will never have a dead ReLU problem.

We can consider different multiplication factors to form different variations of Leaky ReLU.

5. ELU(Exponential Linear Units): ELU is also a variation of ReLU, with a better value for x<0. It also has the same properties as ReLU along with:

(a.) No Dead ReLU Situation.

(b.) Closer to Zero mean Outputs than Leaky ReLU.

(c.) More Computation because of Exponential Function.

6. Maxout: Maxout was introduced in 2013. It has the property of Linearity in it. So, it never saturates or dies but is Expensive as it doubles the parameters.

Mostly, Neural Networks go for different variations of RELU for its simplicity and easy computation both during forward and backward. Although, in some instances, other activation functions give us better results, Like Sigmoid is used as last layer when we want our outputs to be squashed between [0,1], or tanh is used with RNNs and LSTMs.

References
[1] Introduction to Different Activation Functions for Deep Learning
[2] Lecture 6: Training Neural Networks Part I, CS231n