Experimental Logs for MovieLens
tags: [ proj:penalty , experiments , matrix_completion ]
Summary: many of the interesting patterns/phenomena that we see in the noiseless MC case (synthetic data) do not translate when we move to real-life data.1 What’s interesting is that part of me wants to argue like a theoretician, and say that it doesn’t really matter that it doesn’t work on real-life problems. It is slightly different to the way theoretical statistics works, because there the statistical models are usually motivated by solving some real-world problem (though again the same could be said for collaborative filtering). But somehow this feels a little different, as the low-rank matrix problem is a much more general problem than the optimal estimation rates of a particular statistical model. And plus we don’t actually make any claims about the noisy model. The simple answer is that the noisy problem is equalizing, and makes the problem almost trivial (?); that or the fact that the problem isn’t actually that low-rank, so it’s moot to run true low-rank algorithms.
Data
The data here is the MovieLens 100K. The matrix is roughly \(1000 \times 1000\), and so in fact we are only given around 10% of the data. Thus, the 80% split is actually a low-data regime, since we’re talking about seeing only around 8% of the total matrix (even though we never see the other 90%).
The question becomes: what is the most appropriate model for real-life data?
The most natural model would simply be low rank matrix + noise.2 A recent paper “Why Are Big Data Matrices Approximately Low Rank?” argues that many big data matrices that have some sort of latent representation are close to a low-rank representation (+ noise). Or, perhaps the noise is …
One might think of a more general model where there is some non-linear map (which I believe is what motivates the more Neural Collaborative Filtering). Though in this case, I wonder if the non-linear activation functions would be helpful.
Logs
Under the Full model: 80/20 train/test split that everyone uses.
- Obs: generally speaking, the rank in terms of performance is: \(2, 1, 3, 4, \ldots\)
- focusing on depth-1: this definitively refutes the conjecture that depth + GD is exactly equivalent to linear model + Adam + penalty.3 While I never believed it to be exactly equivalent, this result feels like actually there’s something crucial missing in the Adam version.
- what’s weird here is that basically depth 2 is the best, followed by 1. though I guess in the synthetic data, the only difference is that 2 and 1 are switched (the question is whether or not the same thing holds for GD).
- Obs: un-penalized vs penalized changes the effective rank, but doesn’t change the test error.
- this again is a weird one. firstly, it’s weird that it feels like the penalty basically has no effect, that it’s essentially the depth that is driving (and the optimization method) the generalization error.
- also, what we see is that effective rank has very little relation to generalization power.
- though, this is perhaps unsurprising in real-life data. that is why I like working in synthetic data, where there is a direct relation.
- that relation gets broken when you go overboard with the rank reduction at the expense of fitting the data.