#regularization

Implicit Self Regularization in Deep Neural Networks

src: arXiv

defines the Generalization Gap: DNNs generalize better when trained with smaller batch sizes
- i.e. SGD good
characterize the various regimes of the eigenvalue distribution
- very much similar to the Wigner’s Semi-Circle Law, distribution of eigenvalues from a random matrix
- if the elements of the random matrix are correlated, you don’t get the same behavior, but something more heavy-tailed
capacity-control
- use some form of “rank” to describe the capacity of the learned DNN
findings:
- as you decrease batch-size, you get different behavior of the singular values, more “regularized”, corresponding to more “correlated” random-matrices
- what they roughly show is that, during training, you’re effectively reducing the rank
  - more precisely, they go through the 5 stages of distribution of eigenvalues
  - and this clearly translates into something about the rank (since in the heavy tail, then things can escape the main body)
implications:
- by having smaller sizes, it’s able to detect more correlations (?), and as a result, you get the whole heavy-tailed distribution
  - importantly, that leads to better rank (?). this must have something to do with rank.
- The obvious mechanism is that, by training with smaller batches, the DNN training process is able to “squeeze out” more and more finer-scale correlations from the data, leading to more strongly-correlated models. Large batches, involving averages over many more data points, simply fail to see this very fine-scale structure, and thus they are less able to construct strongly-correlated models characteristic of the Heavy-Tailed phase.