#embeddings

Relational Learning and Word Embeddings

Literature

Emergence of Analogy

(Lu, Wu, and Holyoak ⊕2019Lu, Hongjing, Ying Nian Wu, and Keith J Holyoak. 2019. “Emergence of analogy from relation learning.” Proceedings of the National Academy of Sciences of the United States of America 116 (10): 4176–81.):

language learning (relational learning) can be broken down:
- part 1: unsupervised: infant years, when you’re just taking in words. creating building blocks. it’s akin to word embedding models learning from a corpus.
- part 2: supervised: bootstrapping from the above building blocks, can learn higher-order reasoning (relations). akin to small-data, teacher-student model learning.
to “test” this, we start with word2vec, then do supervised learning on analogy pairs, a crucial point here being that you learn this in a distributed manner (so then analogies have their own space).
- actually, it’s a little more nuanced: you first start with word pairs, then do some kind of preprocessing to make them more useful.
- then they make a fuss about the representation being decomposable by the two words, which has the nice property that if you just swap the positions, you’re basically get the reverse relation.
this is all aesthetically pleasing, but how to verify?
- not really sure what the results are here…
they also have problems with antonyms, which was the impetus for my project on [[signed-word-embeddings]].

Predicting Patterns

src: YCC

Psychologists match human performance to differently constructed word relation models (based on Word2Vec). Unsurprisingly they find that the most complicated model (BART) performs the best and best matches human performance, thereby conflating human ability with the raison d’être of BART (learn explicit representations).

Actual Summary:

word relations (like analogies) are complex (contrast, cause-effect, etc.), ambiguous (friend-enemy could signify contrast or similarity, via the notion of frenemy), graded (warm-cool vs hot-cool)
- similar in spirit to words themselves (i.e. words have similar properties)
- since word embeddings (or the idea of distributed representation) did so much for NLP (and have been shown to be a good approximation for how our brains encode words), perhaps word relations themselves are similarly distributed.
lets consider Word2Vec in the context of semantic relations
- taking the diff of two vectors does surprisingly well in determining analogies (and was the selling point of (Mikolov et al. ⊕2013Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed representations ofwords and phrases and their compositionality.” In Advances in Neural Information Processing Systems. Google LLC, Mountain View, United States.)), also see [[explaining-word-embeddings]] for a principled reason why taking the difference should work)
  - that being said, anyone who’s played around with it will know that it’s not particularly robust
this will be our baseline. we’re going to compare it to BART, which is the weird Bayesian model that they created.
- the nice thing about BART is that it “assumes semantic relations are coded by distributed representations across a pool of learned relations”.
now, we’re going to use these two models and compare their analogical reasoning against humans.
- lo-and-behold, BART better matches human performance than word2vec-diff.
  - I mean, I feel like what I’m saying is just too obvious that there’s no way that they haven’t thought about this.
  - they have some robustness claims by referring to the fact that human performance itself was sort of all over the place, in light of the diversity of relations queried (across the two experiments).
- then they have this result: they show that intelligence/cognitive capacity (of the individual) was predictive of clear analogical reasoning (which is generally known) but also of how well BART predicted individual similarity judgement patterns.
  - weird result, not really sure what to make of this – feels like it could just be an artefact of the fact that intelligence is correlated with how variable or predictable your answers are, and so affects BART’s ability to predict.

Debiasing Word Embeddings

Resources

Paper: arXiv

Word embeddings are widely used in NLP for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society. This phenomenon is pervasive and consistent across different word embedding models, causing serious concern. Several recent works tackle this problem, and propose methods for significantly reducing this gender bias in word embeddings, demonstrating convincing results. However, we argue that this removal is superficial. While the bias is indeed substantially reduced according to the provided bias definition, the actual effect is mostly hiding the bias, not removing it. The gender bias information is still reflected in the distances between “gender-neutralized” words in the debiased embeddings, and can be recovered from them. We present a series of experiments to support this claim, for two debiasing methods. We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling.

Paper (causal inference?): arXiv
Paper: arXiv

Signed Word Embeddings

The basis for word embeddings lies in the distributional hypothesis, which states that “a word is characterized by the company it keeps”.

[[idea]]
- I’ve thought about this before, but for some reason I can’t quite find where I wrote about this. basically it follows from two pieces of intuition:
  - that word embeddings cannot distinguish between synonyms and antonyms, because all they care about is co-occurrence
    - as a result, oftentimes you’ll see that antonyms will appear close to each other in the embedding space, which doesn’t really make much sense if we’re think of the embedding space as a proxy for “meaning”
  - that I’ve worked on signed graphs, and so have some intuition about dealing with positive/negative “ties”
- I think that an interesting place to start will be thinking in terms of the SVD decomposition of the PMI matrix (even though that’s not exactly the same thing as SGNS/Word2Vec)
- How does this relate to Signed Graphs?
  - The one time I was thinking of embedding positive and negative edges, it was when considering the spectral decomposition (?) of the graph
  - I vaguely remember a paper that tried to spectral methods to do some sort of visualization (though it was very unwieldy)
- the problem is that when you do things geometrically, then you’re basically going to be conforming to transitivity indirectly. just like when you’re trying to model signed graphs. thus, I think that even though it seems innocuous to have some sort of geometric interpretation, it has this unconscious bias

Backlinks

[[explaining-word-embeddings]]
- this relates to a project that has been at the back of my mind, [[signed-word-embeddings]]
[[relational-learning]]
- they also have problems with antonyms, which was the impetus for my project on [[signed-word-embeddings]].