#CV

Neural Code for Faces

src: (⊕Chang and Tsao 2017Chang, Le, and Doris Y Tsao. 2017. “The Code for Facial Identity in the Primate Brain.” Cell 169 (6): 1013–1028.e14.), Nature news feature.

primates have special(ized) face cells for disambiguation
these cells are found in the inferotemporal (IT) cortex
- two patches are considered: middle lateral/middle fundus (ML/MF), anterior medial (AM)
- previous work showed a hierarchical relationship (AM being downstream, possibly the “final output stage of IT face processing”)
  - and \(\exists\) sparse set in AM which seemed to fire for specific individuals (independent of head orientation), suggesting it was capturing something very high-level
question: what are the functions of these cells? one idea is that each cell encodes a particular individual11 “Jennifer Aniston” cells, after a study in epileptic patients found a cell that would only fire when presented with JA, regardless of face or just her name.
- that does not scale, obviously22 I suspect that they just didn’t have a large enough sample. actually, that’s the whole point of this article! that just because it’s only firing for one thing in your set doesn’t necessarily mean it’s unique. one caveat might be that the things that would trigger activation might not be natural (so practically it’s unique). also, it’s important to distinguish visual/face processing, which is much more low-level, than, say, high-level object recognition, maybe.
answer: construct the following face space:
- decompose a face into two broad category of features
  - shape (S): i.e. the geometry of the facial landmarks
  - appearance (A): the rest (i.e. independent of shape)
    - this is accomplished by morphing all faces to the same average face-shape
- from this decomposition, get 200 S and 200 A features, and perform PCA to get top 25 for each; giving, finally, a 50-d face space!
result: single neurons are axis-coding (i.e. are projections in the face space)
- to test this, simply determine the vector for a cell, and see faces that lie orthogonal to that vector have the same firing:
  - and crucially, they are able to determine the null space for those cells that were previously thought to be individual
- similarly, they can predict response to new faces, which they also do.
result: ML/MF cells fired more for shape, while AM fired more for appearance
result: ~200 face cells needed to decode human faces (and encode)
implications:
- axis-coding, i.e. linear projections, are efficient, robust and flexible (see remark below).
caveats:
- all faces are neutral (i.e. don’t include expressions/emotions): but that seems fair, since we’re doing facial recognition, not emotion detection
- potentially missing axes, given the training data

Remark. They have a section in their paper (Computational Advantages of an Axis Metric over a Distance Metric) that I think doesn’t make much sense: they compare a distance metric (distance to an exemplar face) to an axis metric, so comparing non-linear to linear, but that’s just a bad non-linear model. What’s more confusing is the fact that their model already has non-linearities, in the feature extraction step, it’s just this last step that’s a projection, so really it’s the combination of non-linear and linear (projection) that gets the job done.

Thoughts:

a point made in the paper is that this face space is rather constrained, and every point in this space is a valid/realistic face. however, that also suggests that it might be a little bit too restrictive
- they’ve appreciated that other axes could be missing
- but my worry is that, with results like “they can encode/decode faces,” the strong caveat there is: for faces falling into this particular space
  - granted, it is clearly a large space, but I suspect there are probably more axes available, and by projecting, you’re missing out on other spaces perhaps.
  - i.e. this is not necessarily the full picture
a similar point is, it does feel like the features that they’ve come up with are closely related to the function of the cells, but I’m curious if a completely different encoding/feature-set would also produce the same empirical results
- for instance, as they point out in the section of reproducing these results with a CNN, one doesn’t necessarily have to morph the face to get the appearance features (as it seems biologically implausible for our brains to be morphing faces)
- and so they say you can probably get the same features by extracting information around the eyes
- but then why not just do what we think the biology is doing
a natural follow-up question is, what is the subspace spanned by these vectors: do they complete the space?
- one would hope that the vectors themselves are orthogonal (or perhaps nearly orthogonal), though perhaps there’s redundancies inbuilt into the system (and perhaps the location of the cells might show that)

Backlinks

[[loftus-and-memory]]
- Reading about the malleability of memory, and relating this to the recent ideas from #neuroscience on how our cells work through linear algebra (see [[rotation-dynamics-in-neurons]] and [[neural-code-for-faces]]), I wonder if there’s a similar way of coding the curious properties of memory as artefacts of linear algebra.

Robustness of Facial Embeddings

Facial embeddings (or, more precisely, the methods that produce such embeddings) seem like they would be very susceptible to problems with image artefacts and other distortions to the image data.11 This feels like a common refrain from industry folks, whereby a lot of the algorithms touted by academics are run on very clean, systematic images, and once you move away from the clinical datasets and venture into real life, there’s little guarantee that your algorithms are going to recover anything useful. At least, that’s my impression – do wish I had some data/evidence at my fingertips for this. Even more troubling is that, since they’re usually trained on caucasian datasets, this lack of coverage makes one question even more the fidelity of said model outputs.

Concretely, we have been working on a social network/computer vision problem, the gist of which is the following: we want to extract some facial features from these photographs we have of our cohort (rural villagers in Honduras), and test, for instance, whether friends look more similar. As referenced above, the quality of the photographs varies drastically; we have photographs of driver’s licenses, partially lighted faces, pictures of people in the background, blurry photos. Of course, we did our best to salvage, normalize, or just discard anything that was too poor in quality. On top of all that, we have no idea how the ethnicity of the individuals affects the model (facial embeddings). All in all, it makes one very wary of using such models.

In theory, all the above worries might actually not be realized, and the models are actually incredibly robust. It would be nice if there was a way in which we could interpret these vectors in such a way that we could see if it is working as intended.
- For instance, maybe these embedding models should flag their output if the input is too far away from the typical input.
- At the same time, though, we want to have models that are able to extrapolate well to unseen datasets.
- The key here is being able to detect when the new datasets are reasonable, and lie in the space of possible faces, as opposed to just weird problems with lighting.
I don’t know what the state of the art is right now in terms of extrapolating facial recognition models to other racial groups (and the various fairness problems there), but I suspect there’s probably something to be gleaned from that line of research.
At some level, even the simple problem of how robust these models really are, and if not, how do we make them more robust to image artefacts?
- This is definitely a problem that Apple seems to be working on (or any Big Tech company, for that matter). I’d hazard a guess that they probably have their own internal pipelines that are somewhat proprietary, but nevertheless it might be interesting to see what we might be able to come up with.
Some simple things: if we care about facial features, then there are certain physical manifestations that one can extract (and we do), like pupil distance, facial landmarks.
- If the embedding is capturing anything, it really should include these simple statistics.
- It would be nice to check if that is actually the case. Or, do these models do better by actually manually including these statistics (highly doubt it)?

Update

Clarification on Face Networks

Firstly, face embeddings (or vectors) are very different from word embeddings (language/words are their own special domain), but they’re also slightly different from vector representations of images. It’s still the penultimate layer of the network (e.g. VGG). The key difference is that you’re trying to capture the notion of a face, different from image classification or object detection.22 You could say it’s image classification where the group/category is huge (individual faces), with only slight variation, and you get multiple samples of the same group. So not really. Thus, the training process is going to be different: in particular, what they use are siamese networks (feeding two different images of the same face into the same network), and the loss minimizes the distances between congruous pairs of faces (and maximize distance between incongruous). Actually, in a similar spirit to Word2Vec’s SGNS (skip-gram, negative-sampling), you can do better if, for congruent pairs \((a,b)\) and incongruous \((c)\), you make \(a\) closer to \(b\) than \(c\). The key point here being that the loss is going to be different, and hope is that the projections learned are able to capture something fundamental about people’s faces.

Goal

There are certain diagnostics one may use to check the output of the models: one of them involves looking at which pixels light up, but I think that’s usually only applicable if you have some sort of classification problem.

But I think this is sort of a fundamental goal, but basically, what I would like is for the embedding model to come equipped with a confidence band, telling me how confident I should be that this embedding is actually useful – and clearly the way that I’ve set this up makes one want to relate this to porting a notion of significance to the results of neural networks.

Honduras Face Project

Given a signed social network, various demographics, and faces, what interesting questions can be answered?

Existing Literature

The Faces of Group Members Share Physical Resemblance (doi):
- similar-looking people are more likely to be friends
- not particularly surprising. the main question is one of causal direction?
Multidimensional Homophily in Friendship Networks (link):
- this paper doesn’t actually seem that interesting. not sure what linear model they’re using, but ignoring that, what they show is that the coefficient for two variables (same sex, same ethnicity) are positive while the interaction of these two variables is negative.
- but (I’m pretty sure) all that tells you is that it’s not an additive relationship, so you get diminishing returns for homophily.
- this seems very plausible (they do mention this in the discussion). there’s redundancies involved.
Attractiveness and Symmetry: much existing work showing that people rate averageness as more attractive
- i.e. deviation from the norm is penalized.11 except when dealing with high fashion models, where singularity is prized. this was mentioned in a recent episode on Tyler’s podcast

Backlinks

[[master-paper-list]]
- [[honduras-face-project]]
  - Nature
[[pediatric-transfer-learning]]
- I guess this isn’t particular to pediatric research; anytime you have some more common/majority group of individuals, and you want to study an under-represented group (or, the more stupid thing would be to think that the models trained on the majority group are somehow universal, but in fact fail completely when trained on a minority: a problem I came across in the [[honduras-face-project]]).