202005100105

Project Interpolation

tags: [ proj:interpolation , _paper ]

classical statistics
- bias/variance trade-off
- model complexity
- overfitting
general principles:
- you have a model, where you can tweak its model complexity: classic case of this is knn
  - I actually think that knn is problematic (though, I would have said the same thing about those nonparametric kernel regression methods)
  - essentially, the model “complexity” corresponds to basically being more “local”
  - in other words, complexity here can mean many things (maybe?)
  - perhaps the essence of the problem is the ability to have multiple solutions
    - though, again, that’s refuted by the kernel regression example, right?
- intuition: you don’t want to go overboard with the model complexity, because at that point, you’ve overfit to the “training data”
- ah right, there’s this other notion, of training/test and generalization:
  - this is a more difficult concept to pin down. basically, we think that the data comes from some “distribution” that’s going to be variable
  - if you could assume that the data you’re given is always correct, then perhaps you don’t mind too much about overfitting
  - though, even then, you still worry about overfitting in the sense that you don’t want to overfit on the training data (even aside from the noise problem)
- let’s think of some simulations:
  - if you have noise, then you don’t really want to be overfitting
  - but even if you don’t have noise, you can still be worried about overfitting (is that true?)
    - that is, we usually like to assume some sort of smoothness to the function, which is really just an aesthetic (or an appeal to Occam’s Razor)
    - but perhaps the act of overfitting to the noise-less case will produce something a little too jagged
- so we have training and test data, and the goal is to do better on the test data. even without looking at the test data, the intuition is that if you’ve gotten zero training error, then you’re probably doing something wrong.
  - one might think that if there is noise, then necessarily this must be true, since you’ve fit to the noise
  - but it seems like this depends on the flexibility of the model. that is, the model might be fitting to the noise, but it might be doing it in a very “weird” way, in that it localizes the effect of this training point
  - and if there is no noise, then you’re probably fine.
- theory
  - as we talk about in class, you can basically perform this decomposition of the prediction error (at a point \(x\)) (wiki) in terms of
    - irreducible error (the noise in the data) \(\sigma^2\)
    - bias(\(f(x)\))
    - variance(\(f(x)\))
  - and so what i’m claiming is that, things are very different depending on the \(x\) that we’re considering
    - for \(x\) in the neighborhood of training points
      - we’re going to have low bias and high variance
    - but for \(x\) outside of those neighborhoods
      - we’re going to have the “optimal” trade-off (maybe?)
  - in this way, the decomposition doesn’t need to be thrown away - simply that the decomposition is now adaptive
    - this allows you to get zero training error, while still having optimal decomposition
  - alternative: paper (reviews) speculates that this is actually not a trade-off #paper
    - as you increase the width of neural networks, you get both reduction in variance and bias
    - I guess there isn’t any reason why these things must be opposed?
    - in principle, if your data is pretty straightforward, for instance,
      - something weird about all this is that, in my view, the irreducible error should somehow come into play here. as in, the generalizability of the model should be a function of the irreducible error?
- classification vs regression (and the margin):

Backlinks

[[next-steps-for-interpolation]]
- Let’s record the kinds of experimental data that we are interested in for [[project-interpolation]].
[[matrix-completion-experiments]]
- this relates (nicely) to [[project-interpolation]]. here, we are dealing with depth of 2, and so one could argue that we’re in the regime where we’re just able to interpolate, but not well enough, so you can’t get nice generalization.
  - but it feels like matrix completion is a very particular problem, and since we’re restricting ourselves to just linear neural networks, I don’t expect to see the same kind of double-dipping behaviour.
[[master-paper-list]]
- [[project-interpolation]]
[[does-learning-require-memorization]]
- Here’s an interesting relation to #differential_privacy. One of the motivations for this paper is that DP models (DP implies you can’t memorize) fail to achive SOTA results for the same problem as these memorizing solutions. If you look at how these DP models fail, you see that they fail on the exact class of problems as those proposed here, i.e. it cannot memorize the tail of the mixture distribution. This is definitely something to keep in mind for [[project-interpolation]].