Ridge regression

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated.^[1] It has been used in many fields including econometrics, chemistry, and engineering.^[2] Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems.^[a] It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters.^[3] In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias (see bias–variance tradeoff).^[4]

The theory was first introduced by Hoerl and Kennard in 1970 in their Technometrics papers "Ridge regressions: biased estimation of nonorthogonal problems" and "Ridge regressions: applications in nonorthogonal problems".^[5]^[6]^[1] This was the result of ten years of research into the field of ridge analysis.^[7]

Ridge regression was developed as a possible solution to the imprecision of least square estimators when linear regression models have some multicollinear (highly correlated) independent variables—by creating a ridge regression estimator (RR). This provides a more precise ridge parameters estimate, as its variance and mean square estimator are often smaller than the least square estimators previously derived.^[8]^[2]

Overview[edit]

In the simplest case, the problem of a near-singular moment matrix $\mathbf {X} ^{\mathsf {T}}\mathbf {X}$ is alleviated by adding positive elements to the diagonals, thereby decreasing its condition number. Analogous to the ordinary least squares estimator, the simple ridge estimator is then given by ${\hat {\beta }}_{R}=\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +\lambda \mathbf {I} \right)^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {y}$ where $\mathbf {y}$ is the regressand, $\mathbf {X}$ is the design matrix, $\mathbf {I}$ is the identity matrix, and the ridge parameter $\lambda \geq 0$ serves as the constant shifting the diagonals of the moment matrix.^[9] It can be shown that this estimator is the solution to the least squares problem subject to the constraint $\beta ^{\mathsf {T}}\beta =c$ , which can be expressed as a Lagrangian: $\min _{\beta }\,\left(\mathbf {y} -\mathbf {X} \beta \right)^{\mathsf {T}}\left(\mathbf {y} -\mathbf {X} \beta \right)+\lambda \left(\beta ^{\mathsf {T}}\beta -c\right)$ which shows that $\lambda$ is nothing but the Lagrange multiplier of the constraint.^[10] Typically, $\lambda$ is chosen according to a heuristic criterion, so that the constraint will not be satisfied exactly. Specifically in the case of $\lambda =0$ , in which the constraint is non-binding, the ridge estimator reduces to ordinary least squares. A more general approach to Tikhonov regularization is discussed below.

History[edit]

Tikhonov regularization was invented independently in many different contexts. It became widely known through its application to integral equations in the works of Andrey Tikhonov^[11]^[12]^[13]^[14]^[15] and David L. Phillips.^[16] Some authors use the term Tikhonov–Phillips regularization. The finite-dimensional case was expounded by Arthur E. Hoerl, who took a statistical approach,^[17] and by Manus Foster, who interpreted this method as a Wiener–Kolmogorov (Kriging) filter.^[18] Following Hoerl, it is known in the statistical literature as ridge regression,^[19] named after ridge analysis ("ridge" refers to the path from the constrained maximum).^[20]

Lavrentyev regularization[edit]

In some situations, one can avoid using the transpose $A^{\mathsf {T}}$ , as proposed by Mikhail Lavrentyev.^[25] For example, if $A$ is symmetric positive definite, i.e. $A=A^{\mathsf {T}}>0$ , so is its inverse $A^{-1}$ , which can thus be used to set up the weighted norm squared $\left\|\mathbf {x} \right\|_{P}^{2}=\mathbf {x} ^{\mathsf {T}}A^{-1}\mathbf {x}$ in the generalized Tikhonov regularization, leading to minimizing $\left\|A\mathbf {x} -\mathbf {b} \right\|_{A^{-1}}^{2}+\left\|\mathbf {x} -\mathbf {x} _{0}\right\|_{Q}^{2}$ or, equivalently up to a constant term, $\mathbf {x} ^{\mathsf {T}}\left(A+Q\right)\mathbf {x} -2\mathbf {x} ^{\mathsf {T}}\left(\mathbf {b} +Q\mathbf {x} _{0}\right).$

This minimization problem has an optimal solution $\mathbf {x} ^{*}$ which can be written explicitly using the formula $\mathbf {x} ^{*}=\left(A+Q\right)^{-1}\left(\mathbf {b} +Q\mathbf {x} _{0}\right),$ which is nothing but the solution of the generalized Tikhonov problem where $A=A^{\mathsf {T}}=P^{-1}.$

The Lavrentyev regularization, if applicable, is advantageous to the original Tikhonov regularization, since the Lavrentyev matrix $A+Q$ can be better conditioned, i.e., have a smaller condition number, compared to the Tikhonov matrix $A^{\mathsf {T}}A+\Gamma ^{\mathsf {T}}\Gamma .$

Regularization in Hilbert space[edit]

Typically discrete linear ill-conditioned problems result from discretization of integral equations, and one can formulate a Tikhonov regularization in the original infinite-dimensional context. In the above we can interpret $A$ as a compact operator on Hilbert spaces, and $x$ and $b$ as elements in the domain and range of $A$ . The operator $A^{*}A+\Gamma ^{\mathsf {T}}\Gamma$ is then a self-adjoint bounded invertible operator.

Relation to singular-value decomposition and Wiener filter[edit]

With $\Gamma =\alpha I$ , this least-squares solution can be analyzed in a special way using the singular-value decomposition. Given the singular value decomposition $A=U\Sigma V^{\mathsf {T}}$ with singular values $\sigma _{i}$ , the Tikhonov regularized solution can be expressed as ${\hat {x}}=VDU^{\mathsf {T}}b,$ where $D$ has diagonal values $D_{ii}={\frac {\sigma _{i}}{\sigma _{i}^{2}+\alpha ^{2}}}$ and is zero elsewhere. This demonstrates the effect of the Tikhonov parameter on the condition number of the regularized problem. For the generalized case, a similar representation can be derived using a generalized singular-value decomposition.^[26]

Finally, it is related to the Wiener filter: ${\hat {x}}=\sum _{i=1}^{q}f_{i}{\frac {u_{i}^{\mathsf {T}}b}{\sigma _{i}}}v_{i},$ where the Wiener weights are $f_{i}={\frac {\sigma _{i}^{2}}{\sigma _{i}^{2}+\alpha ^{2}}}$ and $q$ is the rank of $A$ .

Determination of the Tikhonov factor[edit]

The optimal regularization parameter $\alpha$ is usually unknown and often in practical problems is determined by an ad hoc method. A possible approach relies on the Bayesian interpretation described below. Other approaches include the discrepancy principle, cross-validation, L-curve method,^[27] restricted maximum likelihood and unbiased predictive risk estimator. Grace Wahba proved that the optimal parameter, in the sense of leave-one-out cross-validation minimizes^[28]^[29] $G={\frac {\operatorname {RSS} }{\tau ^{2}}}={\frac {\left\|X{\hat {\beta }}-y\right\|^{2}}{\left[\operatorname {Tr} \left(I-X\left(X^{\mathsf {T}}X+\alpha ^{2}I\right)^{-1}X^{\mathsf {T}}\right)\right]^{2}}},$ where $\operatorname {RSS}$ is the residual sum of squares, and $\tau$ is the effective number of degrees of freedom.

Using the previous SVD decomposition, we can simplify the above expression: $\operatorname {RSS} =\left\|y-\sum _{i=1}^{q}(u_{i}'b)u_{i}\right\|^{2}+\left\|\sum _{i=1}^{q}{\frac {\alpha ^{2}}{\sigma _{i}^{2}+\alpha ^{2}}}(u_{i}'b)u_{i}\right\|^{2},$ $\operatorname {RSS} =\operatorname {RSS} _{0}+\left\|\sum _{i=1}^{q}{\frac {\alpha ^{2}}{\sigma _{i}^{2}+\alpha ^{2}}}(u_{i}'b)u_{i}\right\|^{2},$ and $\tau =m-\sum _{i=1}^{q}{\frac {\sigma _{i}^{2}}{\sigma _{i}^{2}+\alpha ^{2}}}=m-q+\sum _{i=1}^{q}{\frac {\alpha ^{2}}{\sigma _{i}^{2}+\alpha ^{2}}}.$

Relation to probabilistic formulation[edit]

The probabilistic formulation of an inverse problem introduces (when all uncertainties are Gaussian) a covariance matrix $C_{M}$ representing the a priori uncertainties on the model parameters, and a covariance matrix $C_{D}$ representing the uncertainties on the observed parameters.^[30] In the special case when these two matrices are diagonal and isotropic, $C_{M}=\sigma _{M}^{2}I$ and $C_{D}=\sigma _{D}^{2}I$ , and, in this case, the equations of inverse theory reduce to the equations above, with $\alpha ={\sigma _{D}}/{\sigma _{M}}$ .

is another regularization method in statistics.

LASSO estimator

Elastic net regularization

Matrix regularization

Gruber, Marvin (1998). . Boca Raton: CRC Press. ISBN 0-8247-0156-9.

Improving Efficiency by Shrinkage: The James–Stein and Ridge Regression Estimators

Kress, Rainer (1998). . Numerical Analysis. New York: Springer. pp. 86–90. ISBN 0-387-98408-9.

"Tikhonov Regularization"

Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. (2007). . Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8.

"Section 19.5. Linear Regularization Methods"

Saleh, A. K. Md. Ehsanes; Arashi, Mohammad; Kibria, B. M. Golam (2019). . New York: John Wiley & Sons. ISBN 978-1-118-64461-4.

Theory of Ridge Regression Estimation with Applications

Taddy, Matt (2019). . Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions. New York: McGraw-Hill. pp. 69–104. ISBN 978-1-260-45277-8.