202005221357

Overparameterized Regression

src: (Arora, Cohen, and Hazan ⊕2018Arora, Sanjeev, Nadav Cohen, and Elad Hazan. 2018. “On the optimization of deep networks: Implicit acceleration by overparameterization.” In 35th International Conference on Machine Learning, Icml 2018, 372–89. Institute for Advanced Studies, Princeton, United States.), section 3 on \(l_p\) regression

They show that in the simple scalar linear regression problem, i.e. loss function

\[ \begin{align*} \E[(x,y)\sim S]{\frac{1}{p}(y - \mathbf{x}^\top \mathbf{w})^p} \end{align*} \]

if you overparameterize ever so slightly to

\[ \begin{align*} \E[(x,y)\sim S]{\frac{1}{p}(y - \mathbf{x}^\top \mathbf{w}_1 w_2)^p}, \end{align*} \]

then gradient descent turns into an accelerated version.

I remember Arora presenting on this particular result a few years back (or something like this), and finding it very intriguing.
I think this goes nicely with the theme of [[statistics-vs-ml]], though it’s a slightly different angle.
- essentially: we statisticians are afraid of overparameterization, because it removes specificity/leads to ambiguity
- but actually, when it comes to these gradient methods, it actually helps to overparameterize

Backlinks

[[on-the-optimization-of-deep-networks]]
- offshoot: [[overparameterized-regression]]