#reinforcement_learning

GPT-3

paper: arXiv

This is a 175b-parameter language model (117x larger than GPT-2). It doesn’t use SOTA architecture (the architecture is from 2018: 96 transfomer layers with 96 attention heads (see [[transformers]])1 The self-attention somehow encodes the relative position of the words (using some trigonometric functions?).), is trained on a few different text corpora, including CommonCrawl (i.e. random internet html text dump), wikipedia, and books,2 The way they determine the pertinence/weight of a source is by upvotes on Reddit. Not sure how I feel about that. and follows the principle of language modelling (unidirectional prediction of next text token).3 So, not bi-directional as is the case for the likes of #BERT. A nice way to think of BERT is a denoising auto-encoder: you feed it sentences but hide some fraction of the words, †and BERT has to guess those words. And yet, it’s able to be surprisingly general purpose and is able to perform surprisingly well on a wide range of tasks (examples curated below).

It turns out that GPT-2 required things to be fine-tuned in order to be able to be applied to various tasks. In the spirit of [[transfer-learning]], the GPT-2 can be thought of the base model, and then for a given task (with training data), you then train on that specific dataset.

GPT-3 no longer requires this fine-tuning step, which is a big deal. The way it works now is that you give it some samples/examples, and then start interacting with it. In fact, that’s it. The idea is that you’re basically populating the context window (which is 2048 tokens), and then letting the model do its thing. That’s basically the extent of the fine-tuning.

They use this window to differentiate between different levels of \(n\)-shot learning. So, zero-shot learning is when you don’t provide any examples of question/answer pairs, one-shot is one example, and few-shot is ~10 examples (that enable you to fill up the context window). They show improved performance as you go from zero to one to few.

So what’s going on here?

Language models are an incredibly simple paradigm: guess the next word in a sequence. Its simplicity, however, belies the level of complexity required to achieve such a task. It’s also a little different from our typical supervised paradigm, as there is no truth (i.e. there is no such thing as a prefect predictor of text/speech). What you want is to create a language model that is indistinguishable from humans (i.e. the Turing test).

What’s sort of curious is that you can frame many problems as text (and many people do so online), and so in order for the langauge model to be able to predict the next word in a wide array of contexts, it sort of has to learn all these various contexts, so indirectly having sub-specialties. Case in point: GPT-3 is able to do pretty simple arithmetics, even without being explicitly told to do so.

I think what is key is this notion of emergent properties that might be happening with large enough networks. It’s like it’s self-organizing, almost. One might be tempted to think that you’re going to get diminishing returns as you increase the size of your model, as is typical of most models in statistics/ML (you get to saturation).4 That being said, there are supposedly a non-trivial collection of papers that investigate this scaling question: link. And yet, it seems like the opposite is true. With a general enough architecture (self-attention or convolutions), these things are able efficiently distribute their knowledge to better predict the output.

Examples

All of these examples basically start by giving example of question/answer pairs.

  • English to Command Line arguments (video)
    • what I find most interesting about this example, and others, is the way in which it is able to adapt to the requests that you make of it. there’s nothing particularly difficult about regurgitating what the command line code for print working directory is. what’s surprising is that it understands what you mean when you try to correct it, or update it
  • Dialogues (gwern)
    • this is more in the realms of the Turing Test, but it’s pretty shocking how it is able to maintain a pretty high-level/meta conversation about the nature of intelligence.
  • Gary Marcus Issues (gwern):
    • ML dissident Marcus has an article arguing that GPT-2, and language models in general, are not capable of any form of intelligence. and he proposes some questions to ask the language model, which gwern finds GPT-3 is able to answer in a satisfactory manner.

Fairness

This is a research direction on their waitlist:

Fairness and Representation: How should performance criteria be established for fairness and representation? How can responsive systems be established to effectively support the goals of fairness and representation in specific, deployed contexts?

So far we’ve seen people explore the biases inherent in word embeddings and language models; GPT-3 will inevitably fall into the purview of such scrutiny. Is there something manifestly different now, with GPT-3, that makes it a much more interesting question, since it feels as if GPT-3 is able to perform some low-level kinds of reasoning.

In the more basic models, you could easily attribute all the inherent biases to a reflection of the state of the world learnable through text.

My feeling is that this is slowly encroaching into the problem of how to equip artificial intelligence with notions of morality (I wonder if anyone has asked ethical questions—probably), though more importantly, might require you to actively force it to take a stand on certain kinds of topics.

Ideas

  • it seems to me that the missing link is the #reinforcement_learning type self-learning aspect. like, right now, the weights of GPT-3 are basically set. but it would be 1000x more worrying if it was talking to itself, or learning from all the interactions we are doing with it. that feedback mechanism feels like it would be the missing picture.

Next Steps for Deep Learning

src: Quora

Deep Learning 1.0

Limitations:

  • training time/labeled data: this is a common refrain about DL/RL, in that it takes an insurmountable amount of data. Contrast this with the [[bitter-lesson]], which claims that we should be utilizing computational power.

  • adversarial attacks/fragility/robustness: due to the nature of DL (functional estimation), there’s always going to be fault-lines.
    • selection bias in training data
    • rare events/models are not able to generalize, out of distribution
  • i.i.d assumption on DGP
    • temporal changes/feedback mechanisms/causal structure?

Deep Learning 2.0

Self-supervised learning—learning by predicting input

  • Supervised learning : this is like when a kid points to a tree and says “dog!”, and you’re like, “no”.
  • Unsupervised learning : when there isn’t an answer, and you’re just trying to understand the data better.
    • classically: clustering/pattern recognition.
    • auto-encoders: pass through a bottleneck to reconstruct the input (the main question is how to develop the architecture)
      • i.e. learning efficient representations
  • Semi-supervised learning : no longer just about reconstructing the input, but self-learning
    • e.g. shuffle-and-learn: shuffle video frames to figure out if it’s in the correct temporal order
    • more about choosing clever opjectives to optimize (to learn something intrinsic about the data)

The brain has about 10^4 synapses and we only live for about 10^9 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 10^5 dimensions of constraint per second.

Hinton via /r/MachineLearning.1 I’m not sure I follow the logic here. My guess is that you have a dichotomy between straight up perceptual input and actually learning by interacting with the world. But that latter feedback mechanism also contains the perceptual input, so it should be at the right scale.

  • much more data than supervised or #reinforcement_learning
  • if coded properly (like the shuffle-and-learn paradigm), then it can learn temporal/causal notions
  • utilization of the agent/learner to direct things, via #attention, for instance.

Figure 1: Newborns learn about gravity at 9m. Pre: experiment push car off and keep it suspended in air (with string) won’t surprise. Post: surprise.

Newborns learn about gravity at 9m. Pre: experiment push car off and keep it suspended in air (with string) won't surprise. Post: surprise.
  • case study: BERT. self-supervised learning, predicting the next word, or missing word in sentence (reconstruction).
    • not multi-sensory/no explicit agents. this prevents it from picking up physical notions.
      • sure, we need multi-sensory for general-purpose intelligence, but it’s still surprising how far you can go with just statistics
  • less successful in image/video (i.e. something special about words/language).2 Could it be the fact that we deal with already highly representative objects that are already classifications of a sort, whereas everything else is at the bit/data level, and we know that’s not really how we process such things. This suggests that we should find an equivalent type of language/vocabulary for images.
    • we operate at the pixel-level, which is suboptimal
  • prediction/learning in (raw) input space vs representation space
    • high-level learning should beat out raw-inputs3 My feeling though is that you eventually have to go down to the raw-input space, because that is where the comparisons are made/the training is done. In other words, you can transform/encode everything into some better representation, but you’ll need to decode it at some point.
Remark. Pattern recognition \(\iff\) self-supervised learning?

Leverage power of compositionality in distributed representations

  • combinatorial explosion should be exploited
  • composition: basis for the intuition as to why depth in neural networks is better than width
Remark. Don’t think this is a particularly interesting insight.

Moving away from stationarity

  • stationarity: train/test distribution is the same distribution
    • they talk about IID, which is not quite the same thing (IID is on the individual sample level—the moment you have correlations then you lose IID)
    • not just following #econometrics, in that the underlying distributions are time-varying though
  • feedback mechanisms (via agents/society) require dropping IID (and relate to [[project-fairness]])
  • example: classifying cows vs camels reduces to classifying desert vs grass (yellow vs green)

Causality

  • causal distributed representations (causal graphs)
    • allows for causal reasoning
  • it’s a little like reasoning in the representation space?

Lifelong Learning

  • learning-to-learn
  • cumulative learning

Inspiration from Nature

flies: video; paper

See [[calculus-for-brain-computation]].