Why we don't use sigmoid activation function for all layers??

Easy Last updated on May 7, 2022, 1:11 a.m.

Sigmoid Activation Function is an exponential form of non-linearity leading to positive outputs. As we can see in the figure below, the sigmoid function value ranges between [0,1].

Markdown Monster icon
Fig 1: Sigmoid Graph

Sigmoid Activation function is widely used as the last layer of neural networks, for their output range of [0 1] that can be used as a probabilistic representation. However, Sigmoid Function has the problem of saturation i.e; Mathematically speaking.

$$ \sigma(x) = \frac{1}{1+e^{-(wx+b)}}$$

As we know sigmoid function has a great differentiation property.
$$ \sigma’(x) = \sigma(x) (1-\sigma(x)) $$

This will result in two cases:

  • if $ (wx+b) \approx \infty $ the sigmoid function becomes
    $$ \sigma(wx) = \frac{1}{1+e^{- \infty }} $$
    $$ \sigma (wx) = 1 $$

Therefore, using the derivative equation above we will get:
$$ \sigma’ (wx) = 1*(1-1) $$

  • if $ (wx+b) \approx - \infty $ the sigmpid function becomes
    $$ \sigma(wx) = \frac{1}{1+e^{\infty }} $$
    $$ \sigma(wx) = 0 $$

similarly, using the derivative equation above we will get:
$$ \sigma’ (wx) = 0*(1-0) $$

In both above cases, we can observe a saturation of outcome as well as gradients. Smaller gradients can lead to slow convergence to optima when using the gradient descent algorithm.