Thread by @hippopedoid, Did you know that the optimal ridge penalty λ in linear regression [...]

Dmitry Kobak

hippopedoid

Did you know that the optimal ridge penalty λ in linear regression can be *negative*? It& #39;s always strictly positive when n>p. Or when cov(x)=I. Or when true β is random. But here we argue that it can be zero or even negative when p>>n: https://arxiv.org/abs/1805.10939 .">https://arxiv.org/abs/1805.... HOW?! [1/n]

This paper started with a question I asked on CrossValidated: https://stats.stackexchange.com/questions/328630.">https://stats.stackexchange.com/questions... Two people gave great answers, and we eventually decided to write it up. The question was: how come some of these CV curves have a minimum at λ→0? It& #39;s p>>n! Why doesn& #39;t it overfit? [2/n]

This paper started with a question I asked on CrossValidated: https://stats.stackexchange.com/questions... Two people gave great answers, and we eventually decided to write it up. The question was: how come some of these CV curves have a minimum at λ→0? It& #39;s p>>n! Why doesn& #39;t it overfit? [2/n]

We realized that this can be reproduced in a toy model where cov(X) has one large nontrivial principal component (it& #39;s called spiked covariance model) and true β aligns with this PC1. When p>>n (and SNR is high), the lowest expected risk is at λ<0 (and this risk is low!). [3/n]

This is because augmenting any model with random uncorrelated predictors and using the minimum-norm β-hat estimator is equivalent to adding ridge penalty (easy to show, see paper). PC1 of X predicts Y, and all the other small PCs act as an implicit ridge regularizer. [4/n]

Here& #39;s the main intuition: If p>>n, then this implicit regularization can become *too strong* and so any additional explicit λ>0 only hurts the expected risk (i.e. test mean squared error), compared to λ=0. Remarkably, in this situation λ<0 can have even lower risk. [5/n]

We argue that it& #39;s not a freak situation, but something that can easily happen in real-life data: true β is often aligned with the leading PCs of X. I& #39;ve seen CV curves with minimum at λ→0 many times with real data. See preprint for MNIST demo with random Fourier features. [6/n]

Also see preprint for calculation of the derivative of the risk with respect to λ at λ=0. If the derivative at zero is positive => the argmin is non-positive. We show how it can happen in the spiked covariance model. [7/n]

Some related recent/parallel work: Hastie et al. 2019, Mei & Montanari @Andrea__M 2019, Nakkiran @PreetumNakkiran et al. 2020. Also Liang & Rakhlin 2018 on ridge penalty in kernel methods. See our Intro/Discussion for many more references. END. [8/8]

You can follow @hippopedoid.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: