Medium Last updated on May 7, 2022, 1:25 a.m.
Linear regression is a parametric regression method that assumes the relationship between y(observed output) and x(observed input) is a linear function with w
and b
parameters. The equation of Linear Regression can be defined as:
$$ f_{Lin}(x)= \sum_{d=1}^{D}w_{d}x_{d} +b $$
Here d
represents the feature dimensions of input.
Note: As Linear regression assumes the linear relation, it is a high bias and low variance modeling approach.
Similar to classification, regression models also require capacity control to avoid overfitting and numerical stability problems in high dimensions. This can be accomplished by:
Basis Expansion: Linear regression models can be extended to capture non-linear relationships by using basis function expansions. In this, the X
variable in equation Y=wX+b, gets converted into a non-linear function (an exponential function, a polynomial function, etc).
Regularization: regularizing the weight parameters during learning also helps in tuning model capacity. To learn more, read our blog on What is Regularization? When do we need it?
w
and b
?OLS selects the linear regression parameters to minimize the mean squared error (MSE) on the training data set. To optimization equation can be written as:
$$ w^\ast , b^\ast = argmin_{w, b} \frac{1}{N} \sum_{i=1}^{N}(y_{i}-x_{i}w+b)^2 $$
To solve this equation mathematically, we can assume that X is a data matrix with one data case per row, and Y is a column vector containing the corresponding outputs. The above optimization equation can be modified to:
$$ w^\ast = argmin_{w} \frac{1}{N} (Y-Xw)^{T}(Y-Xw)$$
By taking first-order derivative and setting it to zero, we get:
$$ 0 = \frac{\partial }{\partial w} \frac{1}{N} (Y-Xw)^{T}(Y-Xw) $$
$$ X^{T} (Y-Xw) = 0 $$
$$ X^{T}Xw = X^{T}Y$$
Therefore, Optimal w
can be defined as:
$$ w^\ast = (X^{T}X)^{-1} X^{T}Y $$
1. Number of Features<= Number of Data points: Linear Regression needs at least D data cases to learn a model with a D dimensional feature vector. Otherwise, the inverse of $X^T X$ is not defined in the optimal w
equation.
2. Sensitive to co-linear features: co-linear features can be mathematically defined as: $feature1 = a*feature2+b$. In the case of co-linear features inverse of $X^T X$ becomes numerically unstable.
3. Computation is cubic in data dimension D.
4. Very sensitive to noise and outliers due to MSE objective function/Normally distributed residuals assumption.