#machine_learning

Explainable Trees

To Read:

Dataset Bias

src: (Tommasi et al. 2015Tommasi, Tatiana, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. 2015. “A Deeper Look at Dataset Bias.” arXiv.org, May. http://arxiv.org/abs/1505.01257v1.)

Machine learning fundamentally operates by finding patterns in datasets.1 I always knew that datasets were biased (especially given the whole fairness problem), and this leads to various problems, but I didn’t realize this was a whole field of study. Fascinating. As such, the particulars of the dataset that you train on will affect what possible models can be learned.

Focusing on visual data for the moment, it is clear that, even while we are in the era of big data, most datasets cannot possibily capture every possible facet of visual information,2 which ties into the problem of self-driving cars whereby your dataset can’t possibly have every single possible circumstance, and thus it is these way off in the tail situations that cause the most headache, much like what people like Taleb always talk about. and so someone has to contend with the ways in which there are blind spots or biases as a result of the particular curation of data.

Causes:

So now we know that we have all these problems with the coverage of the dataset. Ultimately though, the actual thing we care about is generalization performance, or basically how well this does out-of-sample. After all, even if your model has all these issues, if it is intelligent enough to do well on out-of-sample things, then its sort of a moot point.

Key terms:

A Statistician’s View

I think ML people take a very practical view of this problem. Yes, there is talk of conditional/marginal, but I think those are ultimately just convenient words. Statisticians rarely worry about all these problems, mainly because oftentimes the data is observational, as opposed to being curated for the purpopses of training a model. This here is another difference between [[statistics-vs-ml]].

Transformers

Transformers

  • key is multi-head self-attention
    • encoded representation of input: key-value pairs (\(K,V \in \mathbb{R}^{n}\))
      • corresponding to hidden states
    • previous output is compressed into query \(Q \in \mathbb{R}^{m}\).
    • output of the transformer is a weighted sum of the values (\(V\)).

Todo

  • [[toread]]: Visualizing and Measuring the Geometry of BERT arXiv
  • random blog
  • pretty intuitive description of transformers on tumblr, via the LessWrong community

Troubling Trends in Machine Learning Scholarship

src: (Lipton and Steinhardt 2019Lipton, Zachary, and Jacob Steinhardt. 2019. “Troubling Trends in Machine Learning Scholarship.” Queue, February.)

Spurious Theorems

Spurious theorems are common culprits, inserted into papers to lend authoritativeness to empirical results, even when the theorem’s conclusions do not actually support the main claims of the paper.

This is a perfect description of a lot of theorems in machine learning papers (and, for that matter, statistics papers too). As a recent example, Theorem 2 in (Arora, Cohen, and Hazan 2018Arora, Sanjeev, Nadav Cohen, and Elad Hazan. 2018. “On the optimization of deep networks: Implicit acceleration by overparameterization.” In 35th International Conference on Machine Learning, Icml 2018, 372–89. Institute for Advanced Studies, Princeton, United States.) is border-line spurious.

Next Steps for Interpolation

Let’s record the kinds of experimental data that we are interested in for [[project-interpolation]].

  1. Replicating what’s been shown in the literature
  1. New Ideas