202012171722

Robustness of Facial Embeddings

tags: [ CV , proj:face ]

Facial embeddings (or, more precisely, the methods that produce such embeddings) seem like they would be very susceptible to problems with image artefacts and other distortions to the image data.1 This feels like a common refrain from industry folks, whereby a lot of the algorithms touted by academics are run on very clean, systematic images, and once you move away from the clinical datasets and venture into real life, there’s little guarantee that your algorithms are going to recover anything useful. At least, that’s my impression – do wish I had some data/evidence at my fingertips for this. Even more troubling is that, since they’re usually trained on caucasian datasets, this lack of coverage makes one question even more the fidelity of said model outputs.

Concretely, we have been working on a social network/computer vision problem, the gist of which is the following: we want to extract some facial features from these photographs we have of our cohort (rural villagers in Honduras), and test, for instance, whether friends look more similar. As referenced above, the quality of the photographs varies drastically; we have photographs of driver’s licenses, partially lighted faces, pictures of people in the background, blurry photos. Of course, we did our best to salvage, normalize, or just discard anything that was too poor in quality. On top of all that, we have no idea how the ethnicity of the individuals affects the model (facial embeddings). All in all, it makes one very wary of using such models.

Update

Clarification on Face Networks

Firstly, face embeddings (or vectors) are very different from word embeddings (language/words are their own special domain), but they’re also slightly different from vector representations of images. It’s still the penultimate layer of the network (e.g. VGG). The key difference is that you’re trying to capture the notion of a face, different from image classification or object detection.2 You could say it’s image classification where the group/category is huge (individual faces), with only slight variation, and you get multiple samples of the same group. So not really. Thus, the training process is going to be different: in particular, what they use are siamese networks (feeding two different images of the same face into the same network), and the loss minimizes the distances between congruous pairs of faces (and maximize distance between incongruous). Actually, in a similar spirit to Word2Vec’s SGNS (skip-gram, negative-sampling), you can do better if, for congruent pairs \((a,b)\) and incongruous \((c)\), you make \(a\) closer to \(b\) than \(c\). The key point here being that the loss is going to be different, and hope is that the projections learned are able to capture something fundamental about people’s faces.

Goal

There are certain diagnostics one may use to check the output of the models: one of them involves looking at which pixels light up, but I think that’s usually only applicable if you have some sort of classification problem.

But I think this is sort of a fundamental goal, but basically, what I would like is for the embedding model to come equipped with a confidence band, telling me how confident I should be that this embedding is actually useful – and clearly the way that I’ve set this up makes one want to relate this to porting a notion of significance to the results of neural networks.