#optimization

Matrix Completion Optimization

Let’s try to understand the landscape of optimization in matrix completion.

Matrix Factorization

A celebrated result in this area is (Ge, Lee, and Ma ⊕2016Ge, Rong, Jason D Lee, and Tengyu Ma. 2016. “Matrix completion has no spurious local minimum.” In Advances in Neural Information Processing Systems, 2981–9. University of Southern California, Los Angeles, United States.). They consider the matrix factorization problem (with symmetric matrix, so that it can be written as \(XX^\top\)). Note that this problem is non-convex.11 The general version of this problem, with \(UV^\top\), interestingly, is at least alternatingly (marginally) convex (which is why people do alternating minimization). Under some typical incoherence conditions,22 To ensure that there is enough signal in the matrix, preventing cases where all the entries are zero, say. they are able to show that this optimization problem has no spurious local minima. In other words, for every optima to the function \[ \begin{align*} \widetilde{g}(x) = \norm{ P_\Omega (M - X X^\top) }_F^2, \end{align*} \] it is also a global minima (which means \(=0\)).33 They were inspired, in part, by the empirical observation that random initialization sufficed, even though all the theory required either conditions on the initialization, or some more elaborate procedure.

Proof technique: they compare this sample version \(\widetilde{g}(x)\) to the population version \(g(x) = \norm{ M - X X^\top }_F^2\), and use statistics and concentration equalities to get desired bounds.

Remark. The headline (no spurious minima) sounds amazing, but it’s not actually that non-convex. In any case, as it turns out, it’s basically non-convex (i.e. multiple optima) because you have identifiability issues with orthogonal rotations of the matrix (that cancel in the square). I wonder if the incoherence condition is important here. I don’t think so, as it’s a pretty broad set, so it really does seem like this problem is pretty well-behaved.

A nice thing about the factorization is that zero error basically means you’re done (given some conditions on sample size), since having something that factorizes and gets zero error must be the low-rank answer.

Remark. They have a follow-up paper that deals with the asymmetric case, and they sort of cheat by stacking the two matrices together in such a way as to make it PSD again, so their results (or at least their proof technique) rely heavily on the symmetry. Not sure how important this actually is though.

To reiterate, they don’t use any explicit penalty (except something that’s helpful for the proof technique to ensure you are within the coherent regime). The matrix factorization form suffices.

Important: something I missed the first time round, but actually makes this somewhat less impressive: they basically assume the rank is given, and then force the matrix to have that dimension. This differs from (Arora et al. ⊕2019Arora, Sanjeev, Nadav Cohen, Wei Hu, and Yuping Luo. 2019. “Implicit Regularization in Deep Matrix Factorization.” arXiv.org, May. http://arxiv.org/abs/1905.13655v3.)’s deep matrix factorization as there they aren’t forcing the dimensions. This doesn’t seem that impressive anymore.

Non-Convex Optimization

src: off-convex.

The general non-convex optimization is an NP-hard problem, to find the global optima. But even ignoring the problem of local vs global, it’s even difficult to differentiate between optima and saddle points.

Saddle points are those where the gradients in the direction of the axis happen to be zero (usually one is max, the other is a min). In the case of gradient descent, since we move in the direction of gradient for each coordinate separately, then we’re basically stuck.

If one has access to second-order information (i.e. the Hessian), then it is easy to distinguish minima from saddle points (the Hessian is PSD for minima). However, it is usually too expensive to calculate the Hessian.

What you might hope is that the saddle points in the problem are somewhat well-behaved: there exists (at least) one direction such that the gradient is pretty steep. This way, if you can perturb your gradient, then hopefully you’ll accidentally end up in that direction that lets you escape it.

That’s where noisy gradient descent (NGD) comes into play. This is actually a little different to stochastic gradient descent, which is noisy, but that noise is a function of the data. What we want is actually to perturb the gradient itself, so you just add a noise vector with mean 0. This allows the algorithm to effectively explore all directions.44 In subsequent work, it has been established that gradient descent can still escape saddle points, but asymptotically, and take exponential time. On the other hand, NGD can solve it in polynomial time.

What they show is that NGD is able to escape those saddle points, as they eventually find that escape direction. However, the addition of the noise might have problems with actual minima, since it might perturb the updates sufficiently to push it out of the minima. Thankfully, they show that this isn’t the case. Intuitively, this feels like it shouldn’t be that hard – if you imagine a marble on these surfaces, then it feels like it should be pretty easy to perturb oneself to escape the saddles, while it seems fairly difficult to get out of those wells.

Proof technique: replace \(f\) by its second-order taylor expansion \(\hat{f}\) around \(x\), allowing you to work with the Hessian directly. Show that function value drops with NGD, and then show that things don’t change if you go back to \(f\). In order for this to hold, you need the Hessian to be sufficiently smooth; known as the Lipschitz Hessian condition.

It turns out that it is relatively straightforward to extend this to a constrained optimization problem (since you can just write the Lagrangian form which becomes unconstrained). Then you need to consider the tangent and normal space of this constrain set. The algorithm is basically projected noisy gradient descent (simply project onto the constraint set after each gradient update).

On Optimization

Statisticians get their first taste of optimization when they learn about penalized linear regression: \[ \begin{align*} \min_{\beta} \norm{Y - X \beta}_2^2 + \lambda\norm{\beta}_p. \end{align*} \] It turns out that there’s an equivalent formulation (as a constrained minimization) that provides a little more intuition: \[ \begin{align*} \min \norm{Y - X \beta}_2^2 \text{ s.t. } \norm{\beta}_p \leq s, \end{align*} \]

Figure 1: The OG image.

where there is a one-to-one relation between \(\lambda\) and \(s\). This different view provides a geometric interpretation: depending on the geometry induced by the \(l_p\) ball that is the constraint set, you’re going to get different kinds of solutions (see Fig. 1).

Note that the above equivalence is exact only when the loss and the penalty are both convex. In the #matrix_completion setting, with our penalty, this is no longer the case.

The (classical) convex program (Candes and Recht ⊕2009Candes, Emmanuel J, and Benjamin Recht. 2009. “Exact matrix completion via convex optimization.” Foundations of Computational Mathematics 9 (6): 717–72.) is given by: \[ \begin{align*} \min \norm{ W }_\star \text{ s.t. } \norm{ \mathcal{A}(W) - y }_2^2 \leq \epsilon, \end{align*} \] where setting \(\epsilon = 0\) reduces to the noise-less setting. The more recent development using a penalized linear model is given by: \[ \begin{align*} \min_{W} \norm{ \mathcal{A}(W) - y }_2^2 + \lambda \norm{W}_{\star} \end{align*} \] Since they’re both convex, you have equivalence just like for the Lasso.11 I’m pretty sure that people have profitably used this equivalence in this context.

Non-Convex

What happens when we move to non-convex penalties though? Then, I think what you have is a duality gap (I could be wrong). If we use our penalty \(\frac{\norm{W}_{\star}}{\norm{W}_{F}}\), then these two views are no longer equivalent. And here I think actually going back to the penalized loss function gives better results. For one thing, it’s no longer a convex program, amenable to simple solvers.

In our experiments we find that running Adam on this penalized loss function gives great results. My speculation is that this might be the way to solve non-convex programs (I’m sure people have thought about this).

Exponential Learning Rates

via blog and (Li and Arora ⊕2019Li, Zhiyuan, and Sanjeev Arora. 2019. “An Exponential Learning Rate Schedule for Deep Learning.” arXiv.org, October. http://arxiv.org/abs/1910.07454v3.)

Two key properties of SOTA nets: normalization of parameters within layers (Batch Norm); and weight decay (i.e \(l_2\) regularizer). For some reason I never thought of BN as falling in the category of normalizations, ala [[effectiveness-of-normalized-quantities]].

It has been noted that BN + WD can be viewed as increasing the learning rate (LR). What they show is the following:

Theorem 1 (Informal Theorem) Weight Decay + Constant LR + BN + Momentum is equivalent (in function space) to ExpLR + BN + Momentum

The proof holds for any loss function satisfying scale invariance: \[ \begin{align*} L(c \cdot \theta) = L(\theta) \end{align*} \] Here’s an important Lemma:

Lemma 1 A scale-invariant loss \(L\) satisfies: \[ \begin{align*} \langle \nabla_{\theta} L, \theta \rangle &= 0 \\ \nabla_{\theta} L \mid_{\theta = \theta_0} &= c \nabla_{\theta} L \mid_{\theta = c \theta_0} \end{align*} \]

Proof. Taking derivatives of \(L(c \cdot \theta) = L(\theta)\) wrt \(c\), and then setting \(c=1\) gives the first result. Taking derivatives wrt \(\theta\) gives the second result.

Figure 1: Illustration of Lemma

The first result, if you think of it geometrically (Fig. 1), ensures that \(|\theta|\) is increasing. The second result shows that while the loss is scale-invariant, the gradients have a sort of corrective factor such that larger parameters have smaller gradients.

Thoughts

The paper itself is more interested in learning rates. What I think is interesting here is the preoccupation with scale-invariance. There seems to be something self-correcting about it that makes it ideal for neural network training. Also, I wonder if there is any way to use the above scale-invariance facts in our proofs.

They also deal with learning rates, except that the rates themselves are uniform across all parameters, make it much easier to analyze—unlike Adam where you have adaptivity.

The Unreasonable Effectiveness of Adam

Intuition about Adam / History
Adam + Ratio
Intuition
Generalizability
Backlinks

Intuition about Adam / History

We talk about gradient descent (GD), which is a first-order approximation (via Taylor expansion) of minimizing some loss, and Newton’s method (NM) is the second-order version. The key difference is in the step-size, which, in the second-order case, is actually the inverse Hessian (think curvature).

Nesterov comes along, wonders if the GD is optimal. Turns out that if you make use of past gradients via momentum, then you can get better convergence (for convex problems). This is basically like memory.

What remains is still the step-size, which needs to pre-determined. Wouldn’t it be nice to have an adaptive step-size? That’s where AdaGrad enters the picture, and basically uses the inverse of the mean of the square of the past gradients as a proxy for step-size. The problem is that it treats the first gradient equally to the most recent one, which seems unfair. RMSProp makes the adjustment by having it be exponentially weighted, so that more recent gradients are preferentially weighted.

Adam (Kingma and Ba ⊕2015Kingma, Diederik P, and Jimmy Lei Ba. 2015. “Adam: A method for stochastic optimization.” In 3rd International Conference on Learning Representations, Iclr 2015 - Conference Track Proceedings. University of Toronto, Toronto, Canada.) is basically a combination of these two ideas: momentum and adaptive step-size, plus a bias correction term.11 I didn’t think much of the bias correction, but supposedly it’s a pretty big deal in practice. One of its key properties is that it is scale-invariant, meaning that if you multiply all the gradients by some constant, it won’t change the gradient/movement.

Remark. Note here that the invariance of Adam is over time, while our penalty’s invariance is over the parameters themselves.

One interpretation proffered in the paper is that it’s like a signal-to-noise ratio: you have the first moment against the raw (uncentered) second moment. Essentially, the direction of movement is a normalized average of past gradients, in such a way that the more variability in your gradient estimates, the more uncertain the algorithm is, and so the smaller the step size.

In any case, this damn thing works amazingly well in practice. The thing is that we just don’t really understand too much about why it does so well. You have a convergence result in the original paper, but I’m no longer that interested in convergence results. I want to know about #implicit_regularization.

Adam + Ratio

Let’s consider a simple test-bed to hopefully elicit some insight. In the context of the [[project-penalty]]: we want to understand why Adam + our penalty is able to recover the matrix. In our paper, we analyze the gradient of the ratio penalty, and give some heuristics for why it might be better than just the nuclear norm. However, all this analysis breaks down when you move to Adam, because the squared gradient term is applied element-wise.

This is why (Arora et al. ⊕2019Arora, Sanjeev, Nadav Cohen, Wei Hu, and Yuping Luo. 2019. “Implicit Regularization in Deep Matrix Factorization.” arXiv.org, May. http://arxiv.org/abs/1905.13655v3.) work with gradient descent: things behave nicely and you are able to deal with the singular values directly, and everything becomes a matrix operation. It is only natural to try to extend this to Adam, except my feeling is that you can’t really do that. Since we’re now perturbing things element-wise, this basically breaks all the nice linear structure. That doesn’t mean everything breaks down, but simply we can’t resort to a concatenation of simple linear operations.22 Though, as we’ll see below, breaking the linear structure might be why it’s good.

It’s almost like there are two invariances going on: one keeps the observed entries invariant, while the other keeps the singular vectors invariant.

One conjecture as to why Adam might be better is: due to the adaptive learning rate, what it’s actually doing is also relaxing the column space invariance. But this is really just to explain why GD and even momentum are unable to succeed.

A more concrete conjecture: what we’re mimicking is some form of alternating minimization procedure. It would be great if we could show that we’re basically moving around, reducing the rank of the matrix slowly, while staying in the vicinity of the constraint set.

Intuition

Let’s start with some intuition before diving into some theoretical justifications. If we take the extreme case of the solution path having to be exactly in our constraint set, then we’d be doomed. But at the same time, there’s really not much point in your venturing too far away from this set. So perhaps what’s going on is that you’ve relaxed your set (a little like the noisy case), and you can now travel within this expanded set of matrices. Or, it’s more like you’re travelling in and out of this set. I think either way is a close enough approximation of what’s going on, and so it really depends on which provides a good theoretical testbed.

Now, in both our penalty and the standard nuclear norm penalty, the gradients lie in the span of \(W\), which highly restricts the movement direction. One might be able to show that if one were to be constrained as above, and only be able to move within the span, then this does not give enough flexibility. Part of the point is that span of the initial \(W\) with the zero entries is clearly far from the ground truth, so you really want to get away from that span.

Generalizability

One of the problems here is that matrix completion is a simple but also fairly artificial setting. One might ask how generalizable the things we learn in this context are to the wider setting. For one thing, it’s very unusual that you can essentially initialize by essentially fitting the training data exactly, though it turns out that this is okay here. This probably breaks down once you move to noisy matrix completion, but it’s unclear if there’s just a simple fix for that.

Secondly, matrix completion is a linear problem, and a lot of the reasons why the standard things might be failing is because they don’t break the linearity. But once you move to a non-linear setting, then we might be equal footing to everything else. For instance, even when we start overparameterizing (DLNN) the linearity also breaks down, freeing gradient descent from the span of \(W\).

Backlinks

[[implicit-regularization]]
- I really think that one of the lessons of deep learning is the [[the-unreasonable-effectiveness-of-adam]] (or normalization). That is, you should always pick normalized terms over convex terms, because convexity is overrated.

Discretization of Gradient Flow

Via [[michael-jordan-plenary-talk]], to this paper on arXiv.

You have gradient descent, which you can show under convex problems to have a convergence rate of \(\frac{1}{t}\), whereas if you use Nesterov’s accelerated gradient method gets \(\frac{1}{t^2}\).11 And this rate is entirely independent of the ambient dimension of the space.

If you take the limit of the step sizes to zero, then you’re going to get some kind of differential equation. This is gradient flow. It turns out you can basically construct a class of Bregman Lagrangians, which essentially encapsulate all the gradient methods.

You can solve the Lagrangians for a particular rate, and then out pops an differential equation that obtains that rate in continuous time. What’s curious is that the path is identical across all these ODEs. Essentially you’re getting path-independence from the rate, which suggests that this method has found an optimal path, and you can essentially tweak how fast you want to go along that path.

This would suggest that you could then get arbritrary rates for your gradient method. But it turns out that the discretization step is where things break. In fact, Nesterov already has a lower bound on his rate of \(\frac{1}{t^2}\),22 The class of gradient methods are those that have access to all past gradients, I think. so we know it can’t do arbitrarily well. And it turns out that it does match the lower bound. The intuition is that the discretization suffers with curvature. If you go too quickly, then you’re not going to be able to read the curvature well enough.

In other words, discretization is non-trivially different to continuous time. Which sort of makes sense, since in continuous time you have basically all the information.

Finally, in relation to the [[project-penalty]], this doesn’t actually work for things like Adam, which we know to be amazing in practice. So, it seems like there’s still work left to understand why on earth the adaptive learning rates work so well.

Backlinks

[[michael-jordan-plenary-talk]]
- [[discretization-of-gradient-flow]]