Did you know that the optimal ridge penalty λ in linear regression can be *negative*? It's always strictly positive when n>p. Or when cov(x)=I. Or when true β is random. But here we argue that it can be zero or even negative when p>>n: https://arxiv.org/abs/1805.10939 . HOW?! [1/n]
This paper started with a question I asked on CrossValidated: https://stats.stackexchange.com/questions/328630. Two people gave great answers, and we eventually decided to write it up. The question was: how come some of these CV curves have a minimum at λ→0? It's p>>n! Why doesn't it overfit? [2/n]
We realized that this can be reproduced in a toy model where cov(X) has one large nontrivial principal component (it's called spiked covariance model) and true β aligns with this PC1. When p>>n (and SNR is high), the lowest expected risk is at λ<0 (and this risk is low!). [3/n]
This is because augmenting any model with random uncorrelated predictors and using the minimum-norm β-hat estimator is equivalent to adding ridge penalty (easy to show, see paper). PC1 of X predicts Y, and all the other small PCs act as an implicit ridge regularizer. [4/n]
Here's the main intuition: If p>>n, then this implicit regularization can become *too strong* and so any additional explicit λ>0 only hurts the expected risk (i.e. test mean squared error), compared to λ=0. Remarkably, in this situation λ<0 can have even lower risk. [5/n]
We argue that it's not a freak situation, but something that can easily happen in real-life data: true β is often aligned with the leading PCs of X. I've seen CV curves with minimum at λ→0 many times with real data. See preprint for MNIST demo with random Fourier features. [6/n]
Also see preprint for calculation of the derivative of the risk with respect to λ at λ=0. If the derivative at zero is positive => the argmin is non-positive. We show how it can happen in the spiked covariance model. [7/n]
Some related recent/parallel work: Hastie et al. 2019, Mei & Montanari @Andrea__M 2019, Nakkiran @PreetumNakkiran et al. 2020. Also Liang & Rakhlin 2018 on ridge penalty in kernel methods. See our Intro/Discussion for many more references. END. [8/8]
You can follow @hippopedoid.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: