202006131504

Transformers

tags: [ neural_networks , machine_learning , _todo , _attention ]

Transformers

  • key is multi-head self-attention
    • encoded representation of input: key-value pairs (\(K,V \in \mathbb{R}^{n}\))
      • corresponding to hidden states
    • previous output is compressed into query \(Q \in \mathbb{R}^{m}\).
    • output of the transformer is a weighted sum of the values (\(V\)).

Todo

  • [[toread]]: Visualizing and Measuring the Geometry of BERT arXiv
  • random blog
  • pretty intuitive description of transformers on tumblr, via the LessWrong community