What loss functions can be used for regression? Which one is better for outliers?

Medium Last updated on May 7, 2022, 1:22 a.m.

Regression modeling is a form of predictive modeling technique which is used to estimates the relationship between two or more variables. To train a regression model, we need a loss function that can measure the deviation of prediction from the actual value. Some of the commonly used regression loss functions are:
1. Mean Squared Error (MSE).
2. Mean Absolute Error (MAE).
3. Root Mean Squared Error (RMSE).
4. Mean Absolute Percentage Error (MAPE).
5. Huber Loss.
6. Log-Cosh Loss.
7. Poisson Loss.

Markdown Monster icon
Fig 1: Plot of Mean Squared Error. Source: https://arxiv.org/abs/2101.10427

1. Mean Squared Error (MSE): Mean Squared Error is the most commonly used regression loss. Here, we calculate the square of our error and then take its mean. As MSE is quadratic, the penalty is proportional to the square of the error; therefore, it gives higher weight to outliers while smoothening the gradient for more minor errors.

$$ MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i-f(x_i))^2$$

MSE helps converge to the minima efficiently for smaller errors as the gradient gradually reduces, but it is more computationally expensive due to its quadratic nature.

Markdown Monster icon

2. Mean Absolute Error (MAE)/L1 Loss: MAE is the most straightforward error function, and it calculates the absolute difference by taking mod between the actual and predicted values and takes its mean.

$$ MAE = \frac{1}{N} \sum_{i=1}^{N} |y_i-f(x_i)|$$

MAE is a computationally inexpensive method due to its simplicity. However, as it is linear, as shown in fig 1, it is not differentiable where the error is 0, leading to problems in reaching minima.

3. Root Mean Squared Error (RMSE): RMSE is an extension of MSE to reduce the sensitivity of MSE to outliers. Mathematically, RMSE is just the square root of MSE, but this transformation makes it a linear scoring method and, therefore, less computationally expensive.

$$ RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i-f(x_i))^2}$$

The main advantage of RMSE is that it is less sensitive to outliers but still better than MAE. RMSE is generally used in scenarios where there could be more outliers’ in data.

Note: Like MAE, RMSE is also non-differentiable when the error is zero; therefore, reaching global minima could be problematic.

4. Mean Absolute Percentage Error (MAPE): MAPE is a variation of MAE; it calculates error in terms of percentage by taking actual value as the denominator. Adding this feature, MAPE is independent of the scale of features/variables of data.

$$ MAPE = \frac{1}{N} \sum_{i=1}^{N} \frac{|y_i-f(x_i)|}{y_i}*100$$

Although MAPE provides a more comprehensible metric by normalizing all errors on a standard scale. However, if the actual value is zero, it adds intractability to the equation for cases.

Markdown Monster icon

5. Huber Loss: Huber loss is a piecewise function consisting of linear and quadratic scoring methods. To make Huber loss more flexible as per necessity, a hyperparameter delta (δ) helps modify the error. The loss is linear for values above delta and quadratic below delta, as shown in the equation below. The linearity above delta ensures fair weightage to outliers, and the quadratic natures below delta make the function continuous & differentiable, therefore easy to converge.

Markdown Monster icon

Though Huber loss consists of the best properties required in a loss function, it is computationally expensive, especially in the case of a large dataset. Additionally, it required tuning of hyperparameter delta (δ) to reach optimal minima.

Markdown Monster icon

6. Log-Cosh Loss: Log-cosh is quite similar to Huber loss as it is also a combination of linear and quadratic scorings. Log-cosh calculates the log of hyperbolic cosine of the error. Therefore, it has a considerable advantage over Huber loss for its’ property of continuity and differentiability.

$$ LogCosh Loss = \sum_{i=1}^{n} |log(cosh(y_i-f(x_i)))|$$

As Log-Cosh is not a piecewise function, it has comparatively less computational complexity. However, it is less adaptive than Huber loss due to the absence of any hyper-parameter.

7. Poisson Loss: Poisson loss is used with the baseline assumption that the target values come from a Poisson distribution. Say, If we want to create a model to predict a rate parameter based on multiple variable input. For example, predicting the number of customers in the next hour, the number of emails in a day, etc. In these cases, the target variable is the boolean representation of Poisson distribution.

The Poisson loss can be defined as:

$$ PL(y, \hat{y}) = \frac{1}{N} \sum_{i=0}^{N}({\hat{y}}_i - y_ilog{\hat{y}}_i) $$

Which one is better for outliers?

In case of outliers, if there are fewer outliers and we want to strongly penalize the loss function, Mean Squared Error is the best method but if we want to keep it balanced Huber loss is the more favorable approach.