202005120105

Extending the Penalty

tags: [ proj:penalty ]

Question: how to extend the \(L_1/L_2\) ratio penalty to more general settings?

currently, in the case of matrix completion, the penalty is straightforward, since we have a product matrix already as our final goal [[the-form-of-the-loss-in-matrix-completion]].
in more general (or typical) settings, you have the weight matrices (plus bias vectors),
- currently, you can add penalties to the weight matrices themselves, but this seems a little crude
- the parallel for our matrix completion problem would be if we did it for the individual matrices, not the full product matrix (which doesn’t seem that great)
let’s take a step back here:
- I feel like you can ask the same question about (Arora et al. ⊕2019Arora, Sanjeev, Nadav Cohen, Wei Hu, and Yuping Luo. 2019. “Implicit Regularization in Deep Matrix Factorization.” arXiv.org, May. http://arxiv.org/abs/1905.13655v3.). that is, they want to say something about how depth is playing this interesting role where it’s dampening the singular values, or simply making it more sparse
- but if you were to generalize this, then it’s unclear what singular values you’re talking about?11 In fact, this is probably why they don’t actually generalize. A worry I have is that all this work using matrix completion as the test-bed might not be applicable anywhere else.
  - and whatever singular values you want to talk about, well then there’s a matrix, and viola, we can simply just add the regularizer to that matrix, and the equivalence follows
this is a similar problem to what we had when we wanted to have the trivial extension of the matrix completion problem to allow for deep non-linear networks as opposed to DLNNs.
- there, we simply added back the non-linear transforms between the weights.
- we found that it didn’t really do much in terms of test performance
  - seems like, because the problem is linear to begin with, having non-linearities introduced doesn’t really give you any gains.
very recently, a group (Yang, Wen, and Li ⊕2019Yang, Huanrui, Wei Wen, and Hai Li. 2019. “DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures.” arXiv.org, August. http://arxiv.org/abs/1908.09979v2.) used a variant of our penalty to sparsify matrices. it’s not quite what we’re interested in – they basically show that using \(l_1/l_2\) versus just a uniform \(l_1\) penalty on the weights does better at sparsifying.

Backlinks

[[next-steps-for-penalty-paper]]
- See [[extending-the-penalty]].