#transformers

Episode with Ilya

the recent breakthrough for NLP has been #transformers. it turns out that the key contribution is not just the notion of #attention, but that it’s removed the sequential aspect of RNN, which allows for much faster training (and for some reason this allows for more efficient GPU training).
- which, at a higher level, simply means that if you can train larger versions of these deep learning models, they will oftentimes just do better
he makes a claim about how, when training a #language_model with an LSTM, if you increase the size of the hidden layer, then eventually you might get a hidden node that is a sentiment node. in other words, in the beginning, the model is capturing more lower-level features of the data. but as you increase the capacity of the model, it’s able to capture more high level concepts, and it does so naturally, just given the model capacity.
- which essentially argues for just having larger and larger models, as you’ll just get more emergent behaviour