What is Cross Entropy Loss?

Easy Last updated on May 7, 2022, 1:12 a.m.

The Cross-Entropy Loss function is used as a classification Loss Function. It is also known as Log Loss, It measures the performance of a model whose output is in form of probability value in [0,1]. The main reason to use this loss function is that the Cross-Entropy function is of an exponential family and therefore it’s always convex.

Still, for a multilayer neural network having inputs x, weights w, and output y, and loss function L(CrossEntropy) is not going to be convex, due to non-linearities added at each layer in form of activation functions.

Markdown Monster icon
Fig 1: Cross Entropy Loss Function graph for binary classification setting

Cross Entropy Loss Equation

Mathematically, for a binary classification setting, cross entropy is defined as the following equation:

$$ CE Loss = -\frac{1}{m} \sum_{i=1}^{m} y_i*log(p_i) + (1-y_i) * log(1-p_i)$$

here (yi) is the binary indicator (0 or 1) denoting the class for the sample, and
(pi) denotes the predicted probability between 0 and 1 for that input.

Cross Entropy Derivative

As cross-entropy loss function is convex in nature, its derivative is
$$ \frac{\partial CE Loss}{\partial w_t} = -\sum_{i} y_i \frac{\partial log p_i}{\partial w_t} = -\sum_{i} y_i \frac{1}{p_i} \frac{\partial p_i}{\partial w_t} $$

To ensure the output is in the range of [0,1], neural networks use the softmax function as the last layer making

$$ p = \frac{1}{1+e^{-wx+b}}$$ and,
$$ \frac{\partial p}{\partial w}= p(1-p)$$

so, by using differentiation of p, we will obtain
$$ \frac{\partial CE Loss}{\partial w_t} = p_t - y_t $$

Other properties of Cross-Entropy Loss:

  1. cross-entropy loss increases as the predicted probability diverges from the actual label. For example: if P(y_pred=true label)=0.01, would be bad and result in a high loss value. A perfect model would have a loss of 0.

  2. The graph above shows the range of possible loss values given a true observation. As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predictions that are confident and wrong!

Cross-entropy and log loss are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.

Python Implementation

def CrossEntropy(yHat, y):
    if y == 1:
      return -log(yHat)
    else:
      return -log(1 - yHat)