#interpolation

Does Learning Require Memorization?

src: (Feldman ⊕2019Feldman, Vitaly. 2019. “Does Learning Require Memorization? A Short Tale about a Long Tail.” arXiv.org, June. http://arxiv.org/abs/1906.05271v3.)

A different take on the interpolation/memorization conundrum.

The key empirical fact that this paper rests on is this notion of data’s long-tail.11 This was all the rage back in the day, with books written about it. Though that was more about #economics and how the internet was making it possible for things in the long tail to survive. Formally, we can break a class down into subpopulations (say species22 Though they don’t have to be some explicit human-defined category.), corresponding to a mixture distribution with decaying coefficients. The point is that this distribution follows a #power_law distribution.

Now consider a sample from this distribution (which will be our training data). You essentially have three regimes:

Popular: there’s a lot of data here, you don’t need to memorize, as you can take advantage of the law of large numbers.
Extreme Outliers: this is where the actual population itself is already incredibly rare, so it doesn’t really matter if you get these right, since these are so uncommon.
Middle ground: this is middle ground, where you still might only get one sample from this subpopulation, but it’s just common enough (and there are enough of them) that you want to be right. And since they’re uncommon, you basically only have one copy anyway, so your best choice is to memorize.

Key: a priori you don’t know if your samples are from the outlier, or from the middle ground. So you might as well just memorize.33 What you have is that actually, what you don’t mind is selective memorization. Though it’s probably too much work to have two regimes, so just memorize everything.

My general feeling is that there is probably something here, but it feels a little too on the nose. It basically reduces the power of deep learning to learning subclasses well, when I think it’s more about the amalgum of the whole thing.

Relation to DP

Here’s an interesting relation to #differential_privacy. One of the motivations for this paper is that DP models (DP implies you can’t memorize) fail to achive SOTA results for the same problem as these memorizing solutions. If you look at how these DP models fail, you see that they fail on the exact class of problems as those proposed here, i.e. it cannot memorize the tail of the mixture distribution. This is definitely something to keep in mind for [[project-interpolation]].

Next Steps for Interpolation

Let’s record the kinds of experimental data that we are interested in for [[project-interpolation]].

Replicating what’s been shown in the literature

In the Zhang et al. (⊕2019Zhang, Chiyuan, Benjamin Recht, Samy Bengio, Moritz Hardt, and Oriol Vinyals. 2019. “Understanding deep learning requires rethinking generalization.” In 5th International Conference on Learning Representations, Iclr 2017 - Conference Track Proceedings. University of California, Berkeley, Berkeley, United States.) paper, they show that if you randomize the training data labels (in a classification problem), then you can still get zero training loss.
- the point here is that the classical ideas of understanding the performance of machine learning methods (VC dimension) break down11 Though, I always get confused about this. Probably should reread one of the blog posts of Aroras..
In some more recent work (Belkin et al. ⊕2019Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. “Reconciling modern machine-learning practice and the classical biasvariance trade-off.” Proceedings of the National Academy of Sciences 116 (32): 15849–54.), it was cited that if you randomize some of the training labels, you can basically still get the same generalization performance.
- unclear to me if these two results are the same.
- I vaguely remember there was some place that talked about the relevance to #robustness or #adversarial_training.
establish that for standard problems (like CIFAR/MNIST/*), if you perturb the data (some percentage, or be clever about picking the points), then you’re going to maintain test performance.

New Ideas

We conjecture that what’s going on is that there
- show that this