#nlp

Unsupervised Language Translation

Two papers:

Unsupervised Neural Machine Translation

Typically for machine translation, it would be a supervised problem, whereby you have parallel corpora (e.g. UN transcripts). In many cases, however, you don’t have such data. Sometimes you might be able to sidestep this problem if there’s a bridge language where there does exist parallel datasets. What if you don’t have any such data, and simply monolingual data? This would be the unsupervised problem.

Figure 1: Architecture of the proposed system.

Architecture of the proposed system.
  1. They assume the existence of an unsupervised cross-lingual embedding. This is a key assumption, and their entire architecture sort of rests on this. Essentially, you form embedding vectors for two languages separately (which is unsupervised), and then align them algorithmically, so that they now reside in a shared, bilingual space.1 It’s only a small stretch to imagine performing this on multiple languages, so that you get some notion of a universal language space.
  2. From there, you can use a shared encoder, since the two inputs are from a shared space. Recall that the goal of the encoder is to reduce/sparsify the input (from which the decoder can reproduce it) – in the case of a shared encoder, by virtue of the cross-lingual embedding, you’re getting a language-agnostic encoder, which hopefully gets at meaning of the words.
  3. Again, somewhat naturally, this means you’re basically building both directions of the translation, or what they call the dual structure.
  4. Altogether, what you get is a pretty cute autoencoder architecture. Essentially, what you’re doing is training something like a [[siamese-network]]; you have the shared encoder, and then two separate decoders for each language. During training, you’re basically doing normal autoencoding, and then during inference, you just flip – cute!
  5. To ensure this isn’t a trivial task, they take the framework on the denoising autoencoder, and shuffle the words around in the input.2 I’m so used to bag-of-word style models, or even the more classical word embeddings that didn’t care about the ordering in the window, that this just feels like that – we’re harking back to the wild-wild-west, when we didn’t have context-aware embeddings. I guess it’s a little difficult to do something like squeeze all these tokens into a smaller dimension. However, this clearly doesn’t do that much – it’s just scrambling.
  6. The trick is then to adapt the so-called back-translation approach of (Sennrich, Haddow, and Birch 2016Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. Improving Neural Machine Translation Models with Monolingual Data.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 86–96. Berlin, Germany: Association for Computational Linguistics.) in a sort of alternating fashion. I think what it boils down to is just flipping the switch during training.
  7. Altogether, you have two types of mini-batch training schemes, and you alternate between the two. The first is same language (L1 + \(\epsilon\) -> L1), adding noise. The second is different language (L1 -> L2), using the current state of the NMT (neural machine translation) model as the data.

Word Translation Without Parallel Data

Figure 2: Toy Illustration

Toy Illustration
  1. With the similar constraint of just having two monolingual corpora, they tackle step zero of the first paper, namely how to align two embeddings (the unsupervised cross-linguagel embedding step). They employ adversarial training (like GANs).3 And in fact follow the same training mechanism as GANs.
  2. A little history of these cross-lingual embeddings: (Mikolov, Le, and Sutskever 2013Mikolov, Tomas, Quoc V Le, and Ilya Sutskever. 2013. Exploiting Similarities among Languages for Machine Translation.” arXiv.org, September. http://arxiv.org/abs/1309.4168v1.) noticed structural similarities in emebddings across languages, and so used a parallel vocabulary (of 5000 words) to do alignment. Later versions used even smaller intersection sets (e.g. parallel vocabulary of aligned digits of (Artetxe, Labaka, and Agirre 2017Artetxe, Mikel, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data.” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 451–62. Vancouver, Canada: Association for Computational Linguistics.)). The optimization problem is to learn a linear mapping \(W\) such that4 It turns out that enforcing \(W\) to be orthogonal (i.e. a rotation) gives better results, which reduces to the Procrustes algorithm, much like what we used for the dynamic word embedding project. \[ W^\star = \arg\min_{W \in M_d (\mathbb{R})} \norm{ W X - Y }_{F} \]
  3. Given two sets of word embeddings \(\mathcal{X}, \mathcal{Y}\), the discriminator tries to distinguish between elements randomly sampled from \(W\mathcal{X}\) and \(\mathcal{Y}\), while the linear mapping \(W\) (generator) is learned to make that task difficult.
  4. Refinement step: the above procedure doesn’t do that well, because it doesn’t take into account word frequency.5 Why don’t they change the procedure to weigh points according to their frequency then? But now you have something like a supervised dictionary (set of common words): you pick the most frequent words and their mutual nearest neighbours, set this as your synthetic dictionary, and apply the Procrustes algorithm to align once again.
  5. It’s pretty important to ensure that the dictionary is correct, since you’re basically using that as the ground truth by which you align. Using \(k\)-NN is problematic for many reasons (in high dimensions), but one is that it’s asymmetric, and you get hubs (NN of many vectors). They therefore devise a new (similarity) measure, derived from \(k\)-NN: essentially for a word, you consider the \(k\)-NNs in the other domain, and then you take the average cosine similarity. You then penalize the cosine similarity of a pair of vectors by this sort-of neighbourhood concentration.6 Intuitively, you penalize vectors whose NN set is concentrated (i.e. it’s difficult to tell who is the actual nearest neighbor).

Post-Thoughts

It is interesting that these two papers are tackling the same, but different stages, of the NMT pipeline. In hindsight, it would have made much more sense to read the second paper first.

  • Alternating minimization is a common strategy in many areas of computer science. For the most part (at least in academia), it’s application is limited by the difficulty

The Language of Animals

src: NewYorker, Earth Species.

This is a really interesting problem.

Learning as the Unsupervised Alignment of Conceptual Systems

src: (Roads and Love 2020Roads, Brett D, and Bradley C Love. 2020. Learning as the unsupervised alignment of conceptual systems.” Nature Machine Intelligence 2 (1): 76–82.)

The surprising thing about * embeddings is that it relies solely on co-occurrence, which you can define however you want. This makes it a powerful generalized tool.1 And more generally, a key insight in statistical NLP is to not worry (too much) about the words themselves (except maybe during the preprocessing step, with things like stemming), but simply treat them as arbitrary tokens. For example, as in this paper, we can consider objects (or captions) of an image, and co-occurrence for objects that appear together in an image. From this dataset (Open Images V4: github), we can construct a set of embedding vectors (using GloVe) for the objects/captions (call this GloVe-img).

What is this set of embeddings? You can think of this as a crude learning mechanism of the world, using just visual data.2 Potential caveat (?): part of the data reflects what people want to take photos of, and be situated together. Though for the most part the objects in the images aren’t being orchestrated, it’s more just what you find naturally together. In other words, if a child were to learn through proximity-based associations alone, then perhaps this would be the extent of their understanding of the world.

A natural followup is, then, how does this embedding compare to the standard GloVe learned from a large text corpus?

At this point, I need to digress and talk about what this paper does:

Linguistic Neuroscience

Wild Conjectures

As somewhat of an interdisciplinary nut1 like a gun-nut, except, with interdisciplinary research, instead of guns, duh!, it comes as no surprise that I think the advent of computational linguistics has really helped us better understand language. Something that I try to emphasize in class is the uniqueness of languages and words as a data form. In some (highly reductionist) sense, it’s just categorical data. For one thing, that’s definitely not how we would think about it at the outset, but the moment you start thinking statistically, you realize it’s a very natural formulation. Secondly, there’s all this interesting structure and rules (i.e. grammer), meaning, relationships, logic, intelligence. In fact, as we’ve sort of discovered from [[gpt3]], it is quite the proxy for general intelligence. And in between all this are the distributed semantic representation lessons from #embeddings. But I digress.

A natural question to ask is, are all these things we’ve learned about language and words back-propagatable to the way our brains understand and organize language? It is rather tempting to hope that perhaps the way word embedding representation might be similar to the way our brains encode words. And it turns out that there is empirical evidence to suggest that something interesting is going on, though I need to read the literature more carefully. This piece is an attempt to take a skeptical view of all this.

Firstly, I’m pretty sure that we have no idea how our brains handle words/language. What we can do is the next best thing, which is to see how our brain patterns (i.e. fMRI scans) look like when we hear particular words: that is, we can equate words to “brain activation patterns.”2 I’m being particularly vague here because there are many modalities of such patterns, from using magnets to oxygen levels, the details of which I haven’t been bothered to score into memory.3 This reminds me of conversations from #vbw. Probably many millions of human-brain-hours have been spent tackling this problem, so I don’t think this little brain will make any particularly groundbreaking inroads here.

A few remarks though:

  • I think you understand the mechanics of a system best when it’s under stress/strain. In my case, when doing things like word problems/crossword puzzles or just whenever a word is at the tip of your tongue.
    • I’m not really sure what space I’m traversing (if any) when I’m reaching for a particular word that’s at the tip of my tongue, but in any case I feel it must be a confluence of sounds/muscle-memory/meaning/memory.
    • There must be individually differences in terms of degrees of abstract thought.
  • I find that Chinese and English are very different systems in my head (which can be partially explained by my not-so-great Chinese). In particular, I think rather phonetically when it comes to Chinese, which has the funny byproduct that I’m much better at making puns in Chinese than my innate ability would suggest otherwise.
  • I just find it hard to believe that the way I reach words in my head is accessing some space: in particular, that would suggest that oftentimes I’ll mistake words that are very close in said space (which rarely happens).

All this is to say: my default position is that the distributed representation of word might be correlated, but that correlation is almost tautological: that is, if you have any good representation of words, then it must necessarily be correlated with the way our brains process words.4 Having written that statement out, it actually feels like quite a strong statement, and I’m not even sure if it’s true. I think what my intuition is getting at is: the act of finding a representation, it’s like collapsing something down to a finite-dimensional vector space. The process of doing so makes any potential equivalence/correlation moot.

The problem is that, what I’ve describe above is almost impossible to falsify!

Another way to frame this is that, the sheer complexity of language means that if you’re squeezing it down5 It’s a little like a compression algorithm. into a small space (\(\mathbb{R}^{300}\))6 Funny how things quickly become small. then there’s no way for any two representations to be uncorrelated?

Literature

(Huth et al. 2016Huth, Alexander G, Wendy A de Heer, Thomas L Griffiths, Frédéric E Theunissen, and Jack L Gallant. 2016. “Natural speech reveals the semantic maps that tile human cerebral cortex.” Nature 532 (7600): 453–58.): looks like there are dedicated regions of the brain that correspond to various semantic types (and it’s pretty consistent across individuals). they do this by forming a generative model. You have local regions (hidden states) that are disjoint and provde full support, which you then learn.

Remark. This supports my hypothesis (!), in that if there are just distinct regions that trigger depending on the semantic classification, then that should be sufficient to be correlated with the distributed word vectors.

(Fereidooni et al. 2020Fereidooni, Sam, Viola Mocz, Dragomir Radev, and Marvin Chun. 2020. “Understanding and Improving Word Embeddings through a Neuroscientific Lens.” bioRxiv, September, 2020.09.18.304436.): they first show that correlation is significant (but it’s 0.1 (!), which is like…I guess not 0…?). They then find a way to inform word embedding models with data from the features from brain scans (didn’t look at this that carefully, as I don’t really trust some of the authors of this paper).

Relational Learning and Word Embeddings

Literature

Emergence of Analogy

(Lu, Wu, and Holyoak 2019Lu, Hongjing, Ying Nian Wu, and Keith J Holyoak. 2019. “Emergence of analogy from relation learning.” Proceedings of the National Academy of Sciences of the United States of America 116 (10): 4176–81.):

  • language learning (relational learning) can be broken down:
    • part 1: unsupervised: infant years, when you’re just taking in words. creating building blocks. it’s akin to word embedding models learning from a corpus.
    • part 2: supervised: bootstrapping from the above building blocks, can learn higher-order reasoning (relations). akin to small-data, teacher-student model learning.
  • to “test” this, we start with word2vec, then do supervised learning on analogy pairs, a crucial point here being that you learn this in a distributed manner (so then analogies have their own space).
    • actually, it’s a little more nuanced: you first start with word pairs, then do some kind of preprocessing to make them more useful.
    • then they make a fuss about the representation being decomposable by the two words, which has the nice property that if you just swap the positions, you’re basically get the reverse relation.
  • this is all aesthetically pleasing, but how to verify?
    • not really sure what the results are here…
  • they also have problems with antonyms, which was the impetus for my project on [[signed-word-embeddings]].

Predicting Patterns

src: YCC

Psychologists match human performance to differently constructed word relation models (based on Word2Vec). Unsurprisingly they find that the most complicated model (BART) performs the best and best matches human performance, thereby conflating human ability with the raison d’être of BART (learn explicit representations).

Actual Summary:

  • word relations (like analogies) are complex (contrast, cause-effect, etc.), ambiguous (friend-enemy could signify contrast or similarity, via the notion of frenemy), graded (warm-cool vs hot-cool)
    • similar in spirit to words themselves (i.e. words have similar properties)
    • since word embeddings (or the idea of distributed representation) did so much for NLP (and have been shown to be a good approximation for how our brains encode words), perhaps word relations themselves are similarly distributed.
  • lets consider Word2Vec in the context of semantic relations
    • taking the diff of two vectors does surprisingly well in determining analogies (and was the selling point of (Mikolov et al. 2013Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed representations ofwords and phrases and their compositionality.” In Advances in Neural Information Processing Systems. Google LLC, Mountain View, United States.)), also see [[explaining-word-embeddings]] for a principled reason why taking the difference should work)
      • that being said, anyone who’s played around with it will know that it’s not particularly robust
  • this will be our baseline. we’re going to compare it to BART, which is the weird Bayesian model that they created.
    • the nice thing about BART is that it “assumes semantic relations are coded by distributed representations across a pool of learned relations”.
  • now, we’re going to use these two models and compare their analogical reasoning against humans.
    • lo-and-behold, BART better matches human performance than word2vec-diff.
      • I mean, I feel like what I’m saying is just too obvious that there’s no way that they haven’t thought about this.
      • they have some robustness claims by referring to the fact that human performance itself was sort of all over the place, in light of the diversity of relations queried (across the two experiments).
    • then they have this result: they show that intelligence/cognitive capacity (of the individual) was predictive of clear analogical reasoning (which is generally known) but also of how well BART predicted individual similarity judgement patterns.
      • weird result, not really sure what to make of this – feels like it could just be an artefact of the fact that intelligence is correlated with how variable or predictable your answers are, and so affects BART’s ability to predict.

GPT-3

paper: arXiv

This is a 175b-parameter language model (117x larger than GPT-2). It doesn’t use SOTA architecture (the architecture is from 2018: 96 transfomer layers with 96 attention heads (see [[transformers]])1 The self-attention somehow encodes the relative position of the words (using some trigonometric functions?).), is trained on a few different text corpora, including CommonCrawl (i.e. random internet html text dump), wikipedia, and books,2 The way they determine the pertinence/weight of a source is by upvotes on Reddit. Not sure how I feel about that. and follows the principle of language modelling (unidirectional prediction of next text token).3 So, not bi-directional as is the case for the likes of #BERT. A nice way to think of BERT is a denoising auto-encoder: you feed it sentences but hide some fraction of the words, †and BERT has to guess those words. And yet, it’s able to be surprisingly general purpose and is able to perform surprisingly well on a wide range of tasks (examples curated below).

It turns out that GPT-2 required things to be fine-tuned in order to be able to be applied to various tasks. In the spirit of [[transfer-learning]], the GPT-2 can be thought of the base model, and then for a given task (with training data), you then train on that specific dataset.

GPT-3 no longer requires this fine-tuning step, which is a big deal. The way it works now is that you give it some samples/examples, and then start interacting with it. In fact, that’s it. The idea is that you’re basically populating the context window (which is 2048 tokens), and then letting the model do its thing. That’s basically the extent of the fine-tuning.

They use this window to differentiate between different levels of \(n\)-shot learning. So, zero-shot learning is when you don’t provide any examples of question/answer pairs, one-shot is one example, and few-shot is ~10 examples (that enable you to fill up the context window). They show improved performance as you go from zero to one to few.

So what’s going on here?

Language models are an incredibly simple paradigm: guess the next word in a sequence. Its simplicity, however, belies the level of complexity required to achieve such a task. It’s also a little different from our typical supervised paradigm, as there is no truth (i.e. there is no such thing as a prefect predictor of text/speech). What you want is to create a language model that is indistinguishable from humans (i.e. the Turing test).

What’s sort of curious is that you can frame many problems as text (and many people do so online), and so in order for the langauge model to be able to predict the next word in a wide array of contexts, it sort of has to learn all these various contexts, so indirectly having sub-specialties. Case in point: GPT-3 is able to do pretty simple arithmetics, even without being explicitly told to do so.

I think what is key is this notion of emergent properties that might be happening with large enough networks. It’s like it’s self-organizing, almost. One might be tempted to think that you’re going to get diminishing returns as you increase the size of your model, as is typical of most models in statistics/ML (you get to saturation).4 That being said, there are supposedly a non-trivial collection of papers that investigate this scaling question: link. And yet, it seems like the opposite is true. With a general enough architecture (self-attention or convolutions), these things are able efficiently distribute their knowledge to better predict the output.

Examples

All of these examples basically start by giving example of question/answer pairs.

  • English to Command Line arguments (video)
    • what I find most interesting about this example, and others, is the way in which it is able to adapt to the requests that you make of it. there’s nothing particularly difficult about regurgitating what the command line code for print working directory is. what’s surprising is that it understands what you mean when you try to correct it, or update it
  • Dialogues (gwern)
    • this is more in the realms of the Turing Test, but it’s pretty shocking how it is able to maintain a pretty high-level/meta conversation about the nature of intelligence.
  • Gary Marcus Issues (gwern):
    • ML dissident Marcus has an article arguing that GPT-2, and language models in general, are not capable of any form of intelligence. and he proposes some questions to ask the language model, which gwern finds GPT-3 is able to answer in a satisfactory manner.

Fairness

This is a research direction on their waitlist:

Fairness and Representation: How should performance criteria be established for fairness and representation? How can responsive systems be established to effectively support the goals of fairness and representation in specific, deployed contexts?

So far we’ve seen people explore the biases inherent in word embeddings and language models; GPT-3 will inevitably fall into the purview of such scrutiny. Is there something manifestly different now, with GPT-3, that makes it a much more interesting question, since it feels as if GPT-3 is able to perform some low-level kinds of reasoning.

In the more basic models, you could easily attribute all the inherent biases to a reflection of the state of the world learnable through text.

My feeling is that this is slowly encroaching into the problem of how to equip artificial intelligence with notions of morality (I wonder if anyone has asked ethical questions—probably), though more importantly, might require you to actively force it to take a stand on certain kinds of topics.

Ideas

  • it seems to me that the missing link is the #reinforcement_learning type self-learning aspect. like, right now, the weights of GPT-3 are basically set. but it would be 1000x more worrying if it was talking to itself, or learning from all the interactions we are doing with it. that feedback mechanism feels like it would be the missing picture.

Debiasing Word Embeddings

Resources

Word embeddings are widely used in NLP for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society. This phenomenon is pervasive and consistent across different word embedding models, causing serious concern. Several recent works tackle this problem, and propose methods for significantly reducing this gender bias in word embeddings, demonstrating convincing results. However, we argue that this removal is superficial. While the bias is indeed substantially reduced according to the provided bias definition, the actual effect is mostly hiding the bias, not removing it. The gender bias information is still reflected in the distances between “gender-neutralized” words in the debiased embeddings, and can be recovered from them. We present a series of experiments to support this claim, for two debiasing methods. We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling.

Explaining Word Embeddings

Todo

  • recent paper exploring this further: blog

Signed Word Embeddings

The basis for word embeddings lies in the distributional hypothesis, which states that “a word is characterized by the company it keeps”.

Project Misinformation

Goal: detect disinformation

Resource: Awesome list

Literature Review

  • #paper paper on misinformation with neural network
    • they use something known as “cascade model”, which takes advantage of the twitter architecture to capture responses/retweets, and use the content of the retweets to help classify the truthiness of the original tweet
    • #idea there must be something here that allows you to merge ideas of nlp/tweet responses/the underlying social network
  • #paper Bias Misperceived: The Role of Partisanship and Misinformation in YouTube Comment Moderation
    • this is a little different, but it also deals with partisanship: there’s a dataset that has partisanship scores for websites (which then gets linked to Youtube videos in some weird)
    • thesis: is there political bias in terms of youtube comment censorship
  • #paper survey on misinformation 👍
    • covariates:
      • source
      • content
        • lots of “descriptive” results on the traits of fake news headlines (longer titles, more capitalized words)
      • user response (on social media) (cascade)
        • propagation structure
    • methods:
      • cue/feature, which is basically the pre-NLP era way of doing linguistic analysis
        • lie detection: linguistic cues of deception
      • deep learning based methods: this is what we want to target
        • #paper FakeNewsNet: A data repository with news content, social context and dynamic information for studying fake news on social media
          • supposedly, this paper shows that these kinds of methods have bad prediction scores (on the new dataset)
      • feedback-based (covariates/secondary information)
        • propogation
        • temporal
        • response text
        • response users
    • something that we haven’t even talked about, is intervention: what kinds of methods are available to combat these bad actors.

Language Generation