Ridge regression is a statistical regularization approach. It compensates for overfitting on training data in machine learning models. It also commonly known as L2 regularization, is one of several regularization methods for linear regression models. Regularization is a statistical strategy for reducing mistakes caused by overfitting of training data. Ridge regression corrects for multicollinearity in regression analysis. This is important when creating machine learning models with a large number of parameters, especially if the parameters have high weights. While this article focuses on regularization of linear regression models, ridge regression can also be used in logistic regression.

**The problem: multicollinearity**

A standard, multiple-variable linear regression equation is:

Y is the projected value (dependent variable), X is any predictor (independent variable), B is the regression coefficient associated with that independent variable, and X0 is the dependent variable’s value when the independent variable equals zero (also known as the y intercept). Take note of how the coefficients represent the link between the dependent variable and a specific independent variable.

Multicollinearity occurs when two or more predictors have a nearly linear relationship. Montgomery et al. provide one appropriate example: Assume we are analyzing a supply chain delivery dataset in which long-distance deliveries consistently contain a large number of items and short-distance deliveries always contain lesser inventory. Figure 1 depicts a linear correlation between delivery distance and item amount. This causes issues when using them as independent variables in a single prediction model.

This is simply one example of multicollinearity, and the solution is simple: acquire more diverse data (such as data for short-distance deliveries with huge stockpiles). Collecting more data, however, is not always a realistic solution, especially when multicollinearity is inherent in the data reviewed. Other approaches to addressing multicollinearity include increasing sample size, reducing the number of independent variables, or simply using a different model. Such adjustments do not always succeed in eliminating multicollinearity, hence ridge regression is another method for regularizing a model to handle multicollinearity.^{1}

How ridge regression works: the regularization algorithm

When first creating predictive models, we frequently need to compute coefficients because they are not directly indicated in the training data. To estimate coefficients, we can use a typical ordinary least squares (OLS) matrix coefficient estimator.

To understand the operations of this formula, you must be familiar with matrix notation. Simply put, this formula seeks to determine the best-fitting line for a given dataset by calculating coefficients for each independent variable that result in the minimum residual sum of squares (also known as the sum of squared errors).^{2}

Residual sum of squares (RSS) measures how well a linear regression model matches training data. It is represented by the formulation:

This formula computes model prediction accuracy using ground-truth values from the training data. If RSS equals zero, the model accurately predicts the dependent variables. A score of 0 may suggest overfitting, especially with limited training datasets. One possible explanation for this is multicollinearity.

High coefficient estimates are generally indicative of overfitting.^{3} If two or more variables have a strong linear association, OLS may produce erroneously high coefficients. When one or more coefficients are too high, the model’s output is sensitive to small changes in the input data. In other words, the model has overfitted a certain training set and is unable to accurately generalize to new test sets. Such a model is considered unstable.^{4}

Ridge regression extends OLS by computing coefficients to account for potentially linked predictors. Ridge regression corrects for high-value coefficients by incorporating a regularization term (also known as the penalty term) into the RSS function. The penalty term is equal to the sum of the model coefficients squared.^{5} It is reflected in the formulation.

The L2 penalty term is inserted as the end of the RSS function, resulting in a new formulation, the ridge regression estimator. Therein, its effect on the model is controlled by the hyperparameter lambda (λ):

Keep in mind that coefficients indicate the effect of a certain predictor (i.e. independent variable) on the expected value (i.e. dependent variable). When the L2 penalty component is added to the RSS formula, it reduces all coefficient values to balance out extremely high coefficients. In statistics, this is known as coefficient shrinkage. The ridge estimator described above thus computes new regression coefficients that reduce the RSS of a given model. This reduces overfitting on training data by minimizing the effect of each predictor.^{6}

It is important to note that ridge regression does not decrease all coefficients by the same amount. Rather, coefficients shrink in proportion to their original size. As λ grows, high-value coefficients shrink faster than low-value coefficients.^{7} High-value coefficients are thus punished more than low-value coefficients.

Ridge regression versus lasso regression

Note that the L2 penalty reduces coefficients to zero but never to absolute zero; whereas model feature weights may become negligibly small in ridge regression, they never equal zero. Reducing a coefficient to zero effectively eliminates the paired predictor from the model. This is known as feature selection, which is another method of correcting multicollinearity.^{8} Ridge regression does not perform feature selection since the regression coefficients are not reduced to zero.^{9} This is frequently mentioned as a downside of ridge regression. Furthermore, another frequently noted shortcoming is ridge regression’s failure to disentangle predictor effects in the presence of extreme multicollinearity.^{10}

Lasso regression, commonly known as L1 regularization, is one of various regularization methods for linear regression. L1 regularization works by decreasing coefficients to zero, effectively removing the independent variables from the model. Both lasso regression and ridge regression reduce model complexity, albeit in distinct ways. Lasso regression reduces the number of independent variables influencing the outcome. Ridge regression decreases the impact that each independent variable has on the result.

**Other regression regularization techniques**

Elastic net is an additional type of regularization. Elastic net combines both regularization parameters into the RSS cost function, whereas ridge regression uses the sum of squared errors and lasso uses the sum of absolute value of errors.^{11}

Principal component regression (PCR) can also serve as a regularization method. While PCR can address multicollinearity, it does so without penalizing the RSS function like ridge and lasso regression do. Instead, PCR generates linear combinations of associated predictors from which to build a new least squares model.^{12}

Ridge regression in machine learning

**Model complexity**

Ridge regression is a machine learning technique that helps prevent overfitting caused by model complexity. Model complexity can be attributed to:

**A model possessing too many features.**In machine learning, features are predictors of the model and are often referred to as “parameters”. Online tutorials frequently advocate that the number of features in training data sets be less than the number of instances. However, this is not always practicable.

**Features possessing too much weight.** The effect of a certain predictor on the model output is referred to as its feature weight. A high feature weight corresponds to a high-value coefficient.

Simpler models do not necessarily perform better than sophisticated models. Nonetheless, a high level of model complexity can limit a model’s capacity to generalize to new data outside of its training set.

Ridge regression does not do feature selection, hence it cannot minimize model complexity by removing features. However, ridge regression is used when one or more characteristics have a significant impact on the output of a model. Ridge regression can reduce high feature weights (i.e. coefficients) throughout the model using the L2 penalty term. This decreases the model’s complexity and makes model predictions less unpredictable in response to any one or more attributes.

**Bias-variance tradeoff**

Ridge regression is a machine learning technique that involves adding bias to a model in order to reduce its variance. The bias-variance tradeoff is a well-known problem in machine learning. To understand the bias-variance trade-off, it is required to first define “bias” and “variance” in machine learning research.

To summarize, bias is the average difference between anticipated and true values, and variance is the difference between predictions across different model realizations. As bias develops, a model’s predictions on a training dataset become less accurate. As variance increases, a model’s predictions on additional datasets become less reliable. Bias and variance quantify model accuracy on training and test sets, respectively. Obviously, developers want to eliminate model bias and variances. Simultaneous reduction in both is not always practicable, necessitating the use of regularization techniques such as ridge regression.

Ridge regression regularization, as previously discussed, generates more bias in order to reduce variance. In other words, models regularized by ridge regression provide less accurate predictions on training data (greater bias) but more accurate predictions on test data (lower variance).This is a bias-variance tradeoff. Ridge regression allows users to decide an acceptable loss in training accuracy (greater bias) in order to improve a model’s generalization (lower variance).^{13} In this approach, increasing bias can contribute to better overall model performance.

The value λ in the ridge estimator loss function equation determines the intensity of the L2 penalty, which affects the model’s bias-variance tradeoff. If λ is zero, an ordinary least squares function remains. This results in a basic linear regression model with no regularization. Higher λ values indicate more regularization. As λ grows, model bias increases and variance drops. When λ is zero, the model overfits training data; when λ is too high, the model underfits all data.^{14}

Mean square error (MSE) can aid in determining an appropriate λ value.MSE is closely related to RRS and is a method of calculating the average difference between expected and actual values. The smaller a model’s MSE, the better its predictions. However, MSE increases as λ increases. However, it is maintained that there is always a value of λ larger than zero such that the MSE produced by ridge regression is smaller than that obtained through OLS.^{15} To choose an acceptable λ value, find the maximum value that does not increase MSE, as shown in Figure 2. Cross-validation approaches can assist users choose ideal λ values for tweaking their model.^{16}

Example use cases

Ridge regression models are most effective when dealing with datasets containing two or more correlated features. Furthermore, several fields employ ridge regression to deal with models with a larger number of predictors and smaller training datasets.^{17} Such circumstances are prevalent when dealing with a wide range of data.

**Biostatistics**

Computational biology and genetic studies frequently deal with models in which the number of predictors far outnumbers the dataset sample sizes, especially when exploring genetic expression. Ridge regression is one approach to addressing model complexity by decreasing the total weight of these multitudinous characteristics, hence compressing the model’s predictive range.

**Real estate**

The eventual sale price of a house is determined by a variety of variables, many of which are associated, such as the number of bedrooms and bathrooms. Highly linked features result in high regression coefficients and overfitting of training data. Ridge regression accounts for this type of model complexity by reducing total feature weights on the final predicted value.

These are merely two examples from the broader field of data science. However, as these two examples show, ridge regression is most effective when there are more model features than data samples, or when your model contains two or more highly correlated features.

Recent research

Recent research investigates a modified variation of ridge regression for the aim of feature selection.^{18} This modified type of ridge regression uses various regularization settings for each coefficient. This allows one to independently punish feature weights, potentially implementing feature selection using ridge regression.^{19}

References

^{1 }Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining, *Introduction to Linear Regression Analysis*, John Wiley & Sons, 2012.

^{2 }Max Kuhn and Kjell Johnson, *Applied Predictive Modeling*, Springer, 2016. Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, *Regression: Models, Methods and Applications*, 2^{nd} edition, Springer, 2021.

^{3 }Wessel N. van Wieringen, Lecture notes on ridge regression, 2023, https://arxiv.org/pdf/1509.09169.pdf (link resides outside ibm.com)

^{4 }A. K. Md. Ehsanes Saleh, Mohammad Arashi, and B. M. Golam Kibria, *Theory of Ridge Regression Estimation with Applications*, Wiley, 2019.

^{5 }Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, *Regression: Models, Methods and Applications*, 2^{nd} edition, Springer, 2021.

^{6 }Max Kuhn and Kjell Johnson, *Applied Predictive Modeling*, Springer, 2016.

^{7 }A. K. Md. Ehsanes Saleh, Mohammad Arashi, Resve A. Saleh, and Mina Norouzirad, *Rank-Based Methods for Shrinkage and Selection: With Application to Machine Learning*, Wiley, 2022.

^{8 }Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining, *Introduction to Linear Regression Analysis*, John Wiley & Sons, 2012.

^{9 }Max Kuhn and Kjell Johnson, *Applied Predictive Modeling*, Springer, 2016.

^{10 }Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, *Regression: Models, Methods and Applications*, 2^{nd} edition, Springer, 2021.

^{11 }Hui Zou and Trevor Hastie, “Regularization and Variable Selection via the Elastic Net,” Journal of the Royal Statistical Society, Vol. 67, No. 2, 2005, pp. 301–320, https://academic.oup.com/jrsssb/article/67/2/301/7109482 (link resides outside ibm.com)

^{12 }Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, *Regression: Models, Methods and Applications*, 2^{nd} edition, Springer, 2021.

^{13 }Max Kuhn and Kjell Johnson, *Applied Predictive Modeling*, Springer, 2016.

^{14 }Gianluigi Pillonetto, Tianshi Chen, Alessandro Chiuso, Giuseppe De Nicolao, and Lennart Ljung, *Regularized System Identification: Learning Dynamic Models from Data*, Springer, 2022.

^{15 }Arthur E. Hoerl and Robert W. Kennard, “Ridge Regression: Biased Estimation for Nonorthogonal Problems,” *Technometrics*, Vol. 12, No. 1, Feb. 1970, pp. 55-67, https://www.tandfonline.com/doi/abs/10.1080/00401706.2020.1791254 (link resides outside ibm.com)

^{16 }Wessel N. van Wieringen, Lecture notes on ridge regression, 2023, https://arxiv.org/pdf/1509.09169.pdf (link resides outside ibm.com)

^{17 }Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, *Regression: Models, Methods and Applications*, 2^{nd} edition, Springer, 2021.

^{18 }Yichao Wu, “Can’t Ridge Regression Perform Variable Selection?” *Technometrics*, Vol. 63, No. 2, 2021, pp. 263–271, https://www.tandfonline.com/doi/abs/10.1080/00401706.2020.1791254 (link resides outside ibm.com)

^{19 }Danielle C. Tucker, Yichao Wu, and Hans-Georg Müller, “Variable Selection for Global Fréchet Regression,” *Journal of the American Statistical Association*, 2021, https://www.tandfonline.com/doi/abs/10.1080/01621459.2021.1969240 (link resides outside ibm.com)