202005181436

Project Penalty

tags: [ proj:penalty ]

(Just submitted this to NeurIPS.)

In trying to understand why deep learning works so well, what we’ve come to realize is that gradient descent, which seems to be incredibly simple, actually ends up doing a lot of things in the background.
- For instance, under certain specific regimes, gradient descent converges to the minimum Frobenius norm solution.11 Least squares problem using a linear model (i.e. depth-1 neural network).
- This is basically the idea of implicit regularization. In particular, we’re interested in what kinds of biases are induced by combination of the parameterization and the optimization chosen.
Recently, it was conjectured (Gunasekar et al. ⊕2017Gunasekar, Suriya, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. 2017. “Implicit regularization in matrix factorization.” In Advances in Neural Information Processing Systems, 6152–60. TTI, Chicago, United States.) that for the problem of matrix completion, a 2-layer linear neural network trained using gradient descent converges to the minimum nuclear norm solution.
More recently, this conjecture was disputed (Arora et al. ⊕2019Arora, Sanjeev, Nadav Cohen, Wei Hu, and Yuping Luo. 2019. “Implicit Regularization in Deep Matrix Factorization.” arXiv.org, May. http://arxiv.org/abs/1905.13655v3.), who show that if you move to depth-3 (and higher22 In theory, things should get better with depth. In practice, going to depth-3 is sufficient, and things get a little weird if you increase the depth too much.) then it actually does better than the nuclear norm.
- You have to be in a data-poor regime, where the minimum nuclear norm and the minimum rank solution don’t coincide, otherwise you’re potentially conflating things.
- To summarize, they find that depth is doing this really interesting thing in the linear neural network setting, whereby it’s biasing towards low-rank solutions/parameters.
- Further, they conjecture that this effect cannot be captured by a simple penalty (given that the nuclear norm doesn’t work).
As it turns out, there is an explicit penalty. And it’s given by the ratio of the nuclear norm and the Frobenius norm (see [[effectiveness-of-normalized-quantities]]).
But there’s a second piece to this puzzle. A recent result (Arora, Cohen, and Hazan ⊕2018Arora, Sanjeev, Nadav Cohen, and Elad Hazan. 2018. “On the optimization of deep networks: Implicit acceleration by overparameterization.” In 35th International Conference on Machine Learning, Icml 2018, 372–89. Institute for Advanced Studies, Princeton, United States.) showed that the effect of depth is also to something akin to an accelerated gradient method. That is, it has memory of past gradients, and uses that to its advantage. They spectulated again that this is distinct from explicit accelerators.
As it turns out, we suspect it is very similar to Adam, the go-to optimizer in deep learning.
Let’s recap: these two papers above have found really interesting implicit biases that depth (or over-parameterization) is achieving. And importantly, they think these things are unique to implicit regularization – that they can’t just be replaced with some explicit form, which explains why deep learning does so well generally.
What we find is that, actually, in this context, you can indeed do exactly that. You can find explicit forms for the effects of depth! How can we tell?
Let’s just start by getting rid of depth, and consider a linear model that we’re going to train with gradient methods (starting with gradient descent).
- If you add our penalty, absolutely nothing happens. It’s a little disheartening actually.
- However, if you replace the optimization with Adam, then all of a sudden, magic happens. It’s really quite magical. In particular, this obtains state-of-the-art performance.
- A debate I had with my collaborator on this is how to interpret these results once you include depth. My contention is that if you then add depth into the picture, then you can’t really say that you’ve captured depth, because now these two effects are interacting. Also, just because you capture something, this might not be an idempotent function!33 In other words, the effect of depth is not like a projection: applying depth twice is not the same as applying it once. When we talk about things being orthogonal to one another, we might inadvertently think everything is linear and therefore idempotent. In any case, what we find is that generally speaking if you add depth, things change, which is unsurprising.
- What this suggests to us is that actually, what depth is doing is some combination of rank-reduction that can be cleanly captured by our penalty, and something like Adam (definitely not orthogonal to it).
- Indeed, the fact that our model outperforms that of deep matrix factorization suggests that any orthogonal component is actually hurtful!
- While it seems like this isn’t that strong of a result: we haven’t proven any form of equivalence, I think that’s actually the wrong way to think about it: there is most likely not going to be anything that can exactly replicate the effect of depth. What we’re learning is that over-parameterization is a fairly complicated beast, and since it’s implicit, it’s very difficult to pin down. We can look at things like the singular values, and try to model their dynamics, but that’s still not enough to explain the discrepancy.
- The nice thing about matrix completion is that it’s a very simple setting, and the generalization performance varies quite drastically across methods, so you can use that to guide intuition about which biases are actually helpful.
There is oftentimes beauty and truth in simple ideas. The fact that only in the intersection of our penalty and that of Adam do things happen, and the simplicity of this model, gives me confidence that there is something fundamental at play here, definitely for the problem of matrix completion, but hopefully more generally speaking too.
- Is it the case that if we replace our regularizers with normalized versions, then all of a sudden we’re going to get even better performance?

Backlinks

[[master-paper-list]]
- [[project-penalty]]
[[matrix-factorization-to-linear-model]]
- One of the reasons we never tried going to depth-1 for the [[project-penalty]] was that it felt like it would be an impossible task. In particular, one of the advantages of depth 2 is that you are explicitly finding a decomposition of the matrix into two matrices. It felt like therefore you needed at least depth-2 to get something interesting, as that structure would be forced.
[[discretization-of-gradient-flow]]
- Finally, in relation to the [[project-penalty]], this doesn’t actually work for things like Adam, which we know to be amazing in practice. So, it seems like there’s still work left to understand why on earth the adaptive learning rates work so well.
[[the-unreasonable-effectiveness-of-adam]]
- Let’s consider a simple test-bed to hopefully elicit some insight. In the context of the [[project-penalty]]: we want to understand why Adam + our penalty is able to recover the matrix. In our paper, we analyze the gradient of the ratio penalty, and give some heuristics for why it might be better than just the nuclear norm. However, all this analysis breaks down when you move to Adam, because the squared gradient term is applied element-wise.
[[implicit-regularization]]
- In fact, that’s exactly the hope that [[project-penalty]] provides: in a very simple example, we see that we can replace implicit by explicit, to great effect.