202012171722

Robustness of Facial Embeddings

Facial embeddings (or, more precisely, the methods that produce such embeddings) seem like they would be very susceptible to problems with image artefacts and other distortions to the image data.11 This feels like a common refrain from industry folks, whereby a lot of the algorithms touted by academics are run on very clean, systematic images, and once you move away from the clinical datasets and venture into real life, there’s little guarantee that your algorithms are going to recover anything useful. At least, that’s my impression – do wish I had some data/evidence at my fingertips for this. Even more troubling is that, since they’re usually trained on caucasian datasets, this lack of coverage makes one question even more the fidelity of said model outputs.

Concretely, we have been working on a social network/computer vision problem, the gist of which is the following: we want to extract some facial features from these photographs we have of our cohort (rural villagers in Honduras), and test, for instance, whether friends look more similar. As referenced above, the quality of the photographs varies drastically; we have photographs of driver’s licenses, partially lighted faces, pictures of people in the background, blurry photos. Of course, we did our best to salvage, normalize, or just discard anything that was too poor in quality. On top of all that, we have no idea how the ethnicity of the individuals affects the model (facial embeddings). All in all, it makes one very wary of using such models.

In theory, all the above worries might actually not be realized, and the models are actually incredibly robust. It would be nice if there was a way in which we could interpret these vectors in such a way that we could see if it is working as intended.
- For instance, maybe these embedding models should flag their output if the input is too far away from the typical input.
- At the same time, though, we want to have models that are able to extrapolate well to unseen datasets.
- The key here is being able to detect when the new datasets are reasonable, and lie in the space of possible faces, as opposed to just weird problems with lighting.
I don’t know what the state of the art is right now in terms of extrapolating facial recognition models to other racial groups (and the various fairness problems there), but I suspect there’s probably something to be gleaned from that line of research.
At some level, even the simple problem of how robust these models really are, and if not, how do we make them more robust to image artefacts?
- This is definitely a problem that Apple seems to be working on (or any Big Tech company, for that matter). I’d hazard a guess that they probably have their own internal pipelines that are somewhat proprietary, but nevertheless it might be interesting to see what we might be able to come up with.
Some simple things: if we care about facial features, then there are certain physical manifestations that one can extract (and we do), like pupil distance, facial landmarks.
- If the embedding is capturing anything, it really should include these simple statistics.
- It would be nice to check if that is actually the case. Or, do these models do better by actually manually including these statistics (highly doubt it)?

Update

Clarification on Face Networks

Firstly, face embeddings (or vectors) are very different from word embeddings (language/words are their own special domain), but they’re also slightly different from vector representations of images. It’s still the penultimate layer of the network (e.g. VGG). The key difference is that you’re trying to capture the notion of a face, different from image classification or object detection.22 You could say it’s image classification where the group/category is huge (individual faces), with only slight variation, and you get multiple samples of the same group. So not really. Thus, the training process is going to be different: in particular, what they use are siamese networks (feeding two different images of the same face into the same network), and the loss minimizes the distances between congruous pairs of faces (and maximize distance between incongruous). Actually, in a similar spirit to Word2Vec’s SGNS (skip-gram, negative-sampling), you can do better if, for congruent pairs \((a,b)\) and incongruous \((c)\), you make \(a\) closer to \(b\) than \(c\). The key point here being that the loss is going to be different, and hope is that the projections learned are able to capture something fundamental about people’s faces.

Goal

There are certain diagnostics one may use to check the output of the models: one of them involves looking at which pixels light up, but I think that’s usually only applicable if you have some sort of classification problem.

But I think this is sort of a fundamental goal, but basically, what I would like is for the embedding model to come equipped with a confidence band, telling me how confident I should be that this embedding is actually useful – and clearly the way that I’ve set this up makes one want to relate this to porting a notion of significance to the results of neural networks.