202012281125

Relational Learning and Word Embeddings

tags: [ src:paper , psychology , nlp , embeddings ]

Literature

Emergence of Analogy

(Lu, Wu, and Holyoak ⊕2019Lu, Hongjing, Ying Nian Wu, and Keith J Holyoak. 2019. “Emergence of analogy from relation learning.” Proceedings of the National Academy of Sciences of the United States of America 116 (10): 4176–81.):

language learning (relational learning) can be broken down:
- part 1: unsupervised: infant years, when you’re just taking in words. creating building blocks. it’s akin to word embedding models learning from a corpus.
- part 2: supervised: bootstrapping from the above building blocks, can learn higher-order reasoning (relations). akin to small-data, teacher-student model learning.
to “test” this, we start with word2vec, then do supervised learning on analogy pairs, a crucial point here being that you learn this in a distributed manner (so then analogies have their own space).
- actually, it’s a little more nuanced: you first start with word pairs, then do some kind of preprocessing to make them more useful.
- then they make a fuss about the representation being decomposable by the two words, which has the nice property that if you just swap the positions, you’re basically get the reverse relation.
this is all aesthetically pleasing, but how to verify?
- not really sure what the results are here…
they also have problems with antonyms, which was the impetus for my project on [[signed-word-embeddings]].

Predicting Patterns

src: YCC

Psychologists match human performance to differently constructed word relation models (based on Word2Vec). Unsurprisingly they find that the most complicated model (BART) performs the best and best matches human performance, thereby conflating human ability with the raison d’être of BART (learn explicit representations).

Actual Summary:

word relations (like analogies) are complex (contrast, cause-effect, etc.), ambiguous (friend-enemy could signify contrast or similarity, via the notion of frenemy), graded (warm-cool vs hot-cool)
- similar in spirit to words themselves (i.e. words have similar properties)
- since word embeddings (or the idea of distributed representation) did so much for NLP (and have been shown to be a good approximation for how our brains encode words), perhaps word relations themselves are similarly distributed.
lets consider Word2Vec in the context of semantic relations
- taking the diff of two vectors does surprisingly well in determining analogies (and was the selling point of (Mikolov et al. ⊕2013Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed representations ofwords and phrases and their compositionality.” In Advances in Neural Information Processing Systems. Google LLC, Mountain View, United States.)), also see [[explaining-word-embeddings]] for a principled reason why taking the difference should work)
  - that being said, anyone who’s played around with it will know that it’s not particularly robust
this will be our baseline. we’re going to compare it to BART, which is the weird Bayesian model that they created.
- the nice thing about BART is that it “assumes semantic relations are coded by distributed representations across a pool of learned relations”.
now, we’re going to use these two models and compare their analogical reasoning against humans.
- lo-and-behold, BART better matches human performance than word2vec-diff.
  - I mean, I feel like what I’m saying is just too obvious that there’s no way that they haven’t thought about this.
  - they have some robustness claims by referring to the fact that human performance itself was sort of all over the place, in light of the diversity of relations queried (across the two experiments).
- then they have this result: they show that intelligence/cognitive capacity (of the individual) was predictive of clear analogical reasoning (which is generally known) but also of how well BART predicted individual similarity judgement patterns.
  - weird result, not really sure what to make of this – feels like it could just be an artefact of the fact that intelligence is correlated with how variable or predictable your answers are, and so affects BART’s ability to predict.