202005111556

Effectiveness of Normalized Quantities

tags: [ proj:penalty , matrix_completion , _statistics ]

In the context of matrix completion, it is known that the nuclear norm, being a convex relaxation of the rank,11 You can think of these as the matrix version of the \(l_1\) to \(l_0\) duality. has very nice low-rank recovery properties. However, in sparser regimes (not enough observed entries), the nuclear norm starts to falter. This is because there is sufficient ambiguity (equivalently, not enough constraints on the convex program).22 Recall that the convex program is something like \(\min \norm{W}_{\star} \text{ s.t. } P_{\Omega}(W^\star) = P_{\Omega}(W)\), where the \(P_{\Omega}\) is the convenient operator to extract the observed entries Thus, minimizing the nuclear norm doesn’t actually recover the true matrix.

However, it turns out that simply normalizing it works wonders. Consider the penalty, \(L_{\star/F}\), which is given by

\[ \begin{align*} \frac{L_1}{L_2} = \frac{ \sum \sigma_i }{ \sqrt{ \sum \sigma_i^2 } }, \end{align*} \]

where \(\sigma_i\) are the singular values. The idea here is that we’re interested in the relative size of our singular values, not the absolute sizes. Using the nuclear norm as a penalty means that you can simply push down all the entries of your matrix to.

Remark. Something that I haven’t really wrapped my head around is why this is not a problem for things like sparse linear regression. My intuition tells me that it’s probably because of the under-determined nature of matrix completion that makes it a different problem. After all, this behaviour only surfaces in the low-data regime.

Compare this to the gradient accelerator ADAM, which, in a nutshell considers a normalized gradient sequence. That is, the update step is given by33 Granted, I am hiding some details, like how the summation is actually weighted to prioritize the most recent gradient, the division is element-wise, plus some computation tricks to ensure no division by zero.

\[ \begin{align*} \frac{ \sum g_t }{\sqrt{\sum g_t^2}}, \end{align*} \]

Remark. Experiments show that it is the denominator that is more important than the numerator. So, really, what seems to be important is that you’re normalizing each gradient parameter by the magnitude of the gradient. I feel like there’s something statistical about all this that we might be able to learn.

I tried to do something about the Hadamard product, but my feeling is that this cannot be captured through matrix operations alone.

where \(g_t\) is the gradients of the loss. The reason I bring up these two quantities is that it turns out that only by using these two methods do we get something interesting coming out. It could very well be coincidence that they have a similar form, but I suspect that perhaps normalized quantities are somehow superior in the deep learning regime.

Parallels

Originally I thought this felt a lot like our scores in #statistics, but upon further inspection I realized that this would only hold if you have assume zero mean, since the entries aren’t actually centered (and no \(\sqrt{n}\)). The problem there is that all the singular values are positive, so it seems sort of weird to say that this is essentially a \(Z\)-score of the null hypothesis that these were drawn from a zero-mean normal distribution.

Backlinks

[[project-penalty]]
- As it turns out, there is an explicit penalty. And it’s given by the ratio of the nuclear norm and the Frobenius norm (see [[effectiveness-of-normalized-quantities]]).
[[exponential-learning-rates]]
- Two key properties of SOTA nets: normalization of parameters within layers (Batch Norm); and weight decay (i.e \(l_2\) regularizer). For some reason I never thought of BN as falling in the category of normalizations, ala [[effectiveness-of-normalized-quantities]].
[[fractals]] Scale-free, or scale-invariance. If you’re given a picture of some rock, then you can’t tell what scale it is at, since you have self-similarity across scaling. Hence you need that reference scale on the bottom right. I wonder if this relates to scale-invariance (à la [[effectiveness-of-normalized-quantities]]).