Overparameterized Regression
tags: [ src:paper ]
src: (Arora, Cohen, and Hazan 2018Arora, Sanjeev, Nadav Cohen, and Elad Hazan. 2018. “On the optimization of deep networks: Implicit acceleration by overparameterization.” In 35th International Conference on Machine Learning, Icml 2018, 372–89. Institute for Advanced Studies, Princeton, United States.), section 3 on \(l_p\) regression
They show that in the simple scalar linear regression problem, i.e. loss function
\[ \begin{align*} \E[(x,y)\sim S]{\frac{1}{p}(y - \mathbf{x}^\top \mathbf{w})^p} \end{align*} \]
if you overparameterize ever so slightly to
\[ \begin{align*} \E[(x,y)\sim S]{\frac{1}{p}(y - \mathbf{x}^\top \mathbf{w}_1 w_2)^p}, \end{align*} \]
then gradient descent turns into an accelerated version.
- I remember Arora presenting on this particular result a few years back (or something like this), and finding it very intriguing.
- I think this goes nicely with the theme of [[statistics-vs-ml]], though it’s a slightly different angle.
- essentially: we statisticians are afraid of overparameterization, because it removes specificity/leads to ambiguity
- but actually, when it comes to these gradient methods, it actually helps to overparameterize
Backlinks
- [[on-the-optimization-of-deep-networks]]
- offshoot: [[overparameterized-regression]]