# Surprises in High-Dimensional Ridgeless Least Squares Interpolation

##### Posted on (Update: )

This post is based on Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2019). Surprises in High-Dimensional Ridgeless Least Squares Interpolation. 53.

## Introduction

- Zhang et al. (2016) showed that state-of-art deep neural network architectures are complex enough that they can be trained to interpolate the data even when the actual labels are replaced by entirely random ones in a thought-provoking experiment.
- Deep neural networks are frequently observed to generalize well in practice despite their enormous complexity. It seems to defy conventional statistical wisdom: interpolation (vanishing training error) is commonly taken to be a proxy for overfitting or poor generalization.
- Belkin et al. (2018bca) pointed out that these concepts are in general distinct, and interpolation does not contradict generalization.
- Estimation in RKHS is a well-understood setting in which interpolation can coexist with good generalization (Liang and Rakhlin, 2018)

### What least squares has to do with neural networks

Suppose that we have a nonlinear model $\bbE(y_i\mid z_i)=f(z_i;\theta)$ relating responses $y_i\in\bbR$ to inputs $z_i\in\bbR^d,i=1,\ldots,n$, via a parameter vector $\theta\in\bbR^p$.

In some problems, the number of parameters $p$ is so large that training effectively moves each of them just a small amount w.r.t. some random initialization $\theta_0\in\bbR^p$. It thus makes sense to linearize the model around $\theta_0$.

Further, supposing that the initialization is such that $f(z;\theta_0)\approx 0$, and letting $\theta = \theta_0+\beta$, then

\[\bbE(y_i\mid z_i) \approx \nabla_\theta f(z_i;\theta_0)^T\beta,\qquad i=1,\ldots,n.\]Therefore, we are led to consider a linear regression problem, with random features $x_i=\nabla_\theta f(z_i;\theta_0),i=1,\ldots,n$ of high-dimensionality ($p$ much greater than $n$).

In this setting, many vectors $\beta$ give rise to a model that interpolates the data. However, using gradient descent on the least squares objective for training yields a special interpolating parameter $\hat\beta$ (having implicit regularity): the least squares solution with minimum $\ell_2$ norm.

### Two different models for the feature distribution.

- a linear model, $x_i=\Sigma^{1/2}z_i\in \bbR^p$, where $z_i\in\bbR^p$ has iid entries with zero mean and unit variance and $\Sigma\in\bbR^{p\times p}$ deterministic and positive definite. This is a standard model in random matrix theory, and we can obatin a fairly complete asymptotic characterization of the out-of-sample prediction risk.
- a nonlinear model, $x_i=\phi(Wz_i)\in\bbR^p$ with $z_i\in\bbR^d, W\in \bbR^{p\times d}$, and $\phi$ an activation function acting componentwise on $Wz_i$.

The authors recover, in a precise quantitative way, several phenomena that have been observed in large-scale neural networks and kernel machine, including “double descent” behaviour of the prediction risk, and the potential benefits of overparametrization.

### Summary of results

Analyze the out-of-sample prediction risk of the minimum $\ell_2$ norm (min-norm, for short) least squares estimator, in an asymptotic setup where both the number of samples and features diverge, $n,p\rightarrow \infty$, and their ratio convergences to a nonzero constant, $p/n\rightarrow \gamma\in (0,\infty)$.

- $\gamma < 1$: underparametrized
- $\gamma > 1$: overparametrized

The results are:

- If $\gamma < 1$, the risk is purely variance (there is no bias), and does not depend on $\beta,\Sigma$. Moreover, the risk diverges as we approach the interpolation boundary (as $\gamma\rightarrow 1$)
*a known result that has appeared in various places in the random matrix theory literature* - If $\gamma > 1$, the risk is composed of both bias and variance, and generally depends on $\beta,\Sigma$. In the two models, the bias is decreasing with strength of correlation, and the variance is nondecreasing.
- If $\SNR \le 1$, the risk is decreasing for $\gamma\in (1,\infty)$, and approaches the null risk (from above) as $\gamma \rightarrow\infty$
- If $\SNR > 1$, the risk has a local minimum on $\gamma\in (1,\infty)$, is better than the null risk for large enough $\gamma$, and approaches the null risk (from below) as $\gamma \rightarrow \infty$.
- For a misspecified model, when $\SNR >1$, the risk can attain its global minimum on $\gamma\in(1,\infty)$ (when there is strong enough approximation bias)
- Optimally-tuned ridge regression dominates the min-norm least squares estimator in risk, across all values of $\gamma$ and SNR, in both the well-specified and misspecified settings. For a misspecified model, optimally-tuned ridge regression attains its global minimum around $\gamma=1$
- Tuning ridge regression by minimizing the leave-one-out cross-validation (CV) error is asymptotically optimal
*points 3 through 6 are formally established for isotropic features, $\Sigma=I$, but qualitatively similar behavior holds for general $\Sigma$.* - For the nonlinear model, as $n,p,d\rightarrow \infty$, with $p/n\rightarrow\gamma\in(0,\infty)$ and $d/p\rightarrow\psi\in(0,1)$, the limiting variance is increasing for $\gamma\in (0,1)$, decreasing for $\gamma\in(1,\infty)$, and diverging as $\gamma\rightarrow 1$. In fact, if $\gamma < 1$, the variance coincides exactly with one in the linear model case
- If $\gamma > 1$, the asymptotic formula for the variance in the nonlinear model depends only on the activation function $\psi$ via its “linear component” $\bbE[G\phi(G)]$.
- If $\gamma > 1$, for a “purely nonlinear” activation function $\psi$, meaning $\bbE[G\psi(G)]=0$, the asymptotic formula for the variance coincides with the one derived for the linear model case, despite the fact that the input dimension $d$ can be much smaller than the number of parameters $p$.
*the nonlinear model results are not covered by existing literature on random matrix theory*

### Intuition and Implications

#### Bias and Variance

When $p > n$, **the min-norm least squares estimate of $\beta$** is constrained to lie the row space of $X$. This is a subspace of dimension $n$ lying in a feature space of dimension $p$.

#### Double Descent

#### In-sample prediction risk

The focus throughout this paper is **out-of-sample prediction risk**.

For the two models, the in-sample prediction risk of the min-norm least squares estimator $\hat\beta$ is

\[\frac 1n\bbE[\Vert X\hat\beta - X\beta\Vert_2^2\mid X] = \sigma^2(p/n \land 1)\,.\]\[\begin{align*} \bbE[\Vert X\hat\beta - X\beta\Vert_2^2\mid X] &= \bbE[(\hat\beta-\beta)'X'X(\hat\beta-\beta)\mid X]\\ &=\tr[X'X\Cov(\hat\beta)] + (E(\hat\beta)-\beta)'X'X(E(\hat\beta) - \beta)\\ \end{align*}\]where $E(\hat\beta) = \beta$, and $\Cov(\hat\beta)=(X’X)^+\sigma^2$. If $p > n$,

\[X'X = V \begin{bmatrix} \Lambda_n & \\ & 0_{p-n} \end{bmatrix} V'\]then

\[(X'X)^+ = V \begin{bmatrix} \Lambda_n^{-1} & \\ & 0_{p-n} \end{bmatrix} V'\,.\]The following simple Julia snippet can be used to validate the above relationship.

`X = randn(10, 20) A = X' * X Λ, V = eigen(A) Λ1 = Λ[11:end] Λ2 = vcat(zeros(10), 1 ./ Λ1) norm(V * diagm(Λ2) * V' - pinv(A), 2) # 3.027733893412882e-15`

The in-sample prediction risk is not always a good proxy for the out-of-sample prediction risk. Although much of classical regression theory is based on the former, such as optimism, effective degrees of freedom, and covariance penalities, the latter is more broadly relevant to practice.

#### Interpolation versus Regularization

It is worthy noting that early stopped gradient descent is known to be closely connected to ridge regularization.

A tight coupling between the two has been recently developed for least squares problems. This coupling implies that optimally-stopped gradient descent will have risk at most 1.22 times that of optimally-tuned ridge regression, and hence will often have better risk than min-norm least squares.

In practice, we would not have access to the optimal tuning parameter for ridge, and would rely on CV. The theory shows that for ridge regression, CV tuning is asymptotically equivalent to optimal tuning.

Historically, the debate between interpolation and regularization has been alive for the last 30 or so years.

#### Virtues of nonlinearity

It is unclear whether the success in precisely analyzing the linear model setting should carry over to the nonlinear setting. Remarkably, under high-dimensional asymptotics with $n,p,d\rightarrow\infty,p/n\rightarrow \gamma$, and $d/p\rightarrow\psi$, this turns out to be the case.

There is a critical difference between neural networks and the studies in this paper: in a neural network, the feature representation and the regression function or classifier are learned simultaneously. In both linear and nonlinear model settings, the features $X$ are not learned, but observed.

## Preliminaries

### Data model and risk

- training data: $(x_i, y_i)\in \IR^p \times \IR,i=1,\ldots,n$
- test point: $x_0\sim P_x$

For an estimator $\hat\beta$ (a function of the training data $X, y$), define its out-of-sample prediction risk

\[R_X(\hat\beta,\beta) = \E[(x_0^T\hat\beta - x_0^T\beta)^2\mid X]=\E[\Vert \hat\beta - \beta\Vert_\Sigma^2\mid X]\,,\]where $\Vert x\Vert_\Sigma^2 = x^T\Sigma x$.

It has the bias-variance decomposition

\[R_X(\hat\beta;\beta) = \Vert \E(\hat\beta\mid X) - \beta\Vert_\Sigma^2 + \Tr[\Cov(\hat\beta\mid X)\Sigma] = B_X(\hat\beta;\beta) + V_X(\hat\beta;\beta)\]Generally, the risk function measures the accuracy of an estimator $\delta(\theta)$ for parameter $\theta$, $R(\theta,\delta) = E_\theta[L(\theta, \delta(X))]$. Two natural global measures of the size of the risk are the average $\int R(\theta, \delta)w(\theta)d\theta$ **(Bayes)** and the maximum $\sup_\Omega R(\theta,\delta)$ **(minimax)**.