202009101158

Framework for Fairness

The general framework (population level) proceeds as follows:

\(X\): this is the population (society), and takes the form of covariates. Most of the time, when dealing with issues of fairness, we’ll single out one particular covariate, which we denote \(G\), demarcating the protected variable(s) in question.
- formally, for simplicity, you should think of \(X\) as a random variable, a high-dimensional Gaussian. \(G\) a Bernoulli random variable.
\(Z = F(X,G)\): this is the Machine Learning model output (and the only place where the model enters the picture).
- This can take on various forms11 Actually, this is for the sample version! In the population, selection would correspond to a probability.
  - Binary: for selection (or top-\(k\))
  - Ordinal: for ranking
  - Continous: for assigning a score22 Oftentimes this might be an intermediate step; the score is then sorted to then produce either a binary outcome after thresholding or just ranked.
- this is essentially a function of the two random variables. You can think of it as a function that maps covariates and group to an outcome (so, in the population version, a probability of selection, say).
\(Y\): (optional) this can be thought of as the ground-truth of a particular individual’s outcome.
\(D\): this is the feedback (dynamic) mechanism that relates the current epoch to the next one.
- key properties: this must be inferred
\(\phi(X)\): this is the main metric to be tracked.
- note that it is a function of \(X\) (our population), and that it is independent of the choice of ML model.

Goal: write out examples as part of our framework. Helps us to see how our framework differs from others, or where we might need to change the current version.

Mouzannar et al. 2018

Let’s consider (Mouzannar, Ohannessian, and Srebro ⊕2018Mouzannar, Hussein, Mesrob I Ohannessian, and Nathan Srebro. 2018. “From Fair Decision Making to Social Equality.” arXiv.org, December. http://arxiv.org/abs/1812.02952v1.), as they probably get the closest to what our framework is trying to achieve, though with very important differences.

\(X\) is given by a score \(\theta\) (e.g. [GPA, SAT]) and \(G\in\left\{ A,B \right\}\).33 With \(X,G\) lying in the same sample space.
- The first thing (that differs from our framework) is that they assume that all the information is encapsulated in the covariates, and there is no ambiguity. Thus, they define \(\widetilde{F}: \theta\to V = \left\{ 0,1 \right\}\) to be a deterministic evaluation of qualification.
  - example: if \(\theta\) is [GPA, SAT], then it says that this is sufficient to determine the eventuality of a student’s success, and that it is deterministic!
- it’s subtly different from the way we think about this. They don’t actually care about \(X\), and in fact, they never really deal with it, since everything below acts on the population through \(\widetilde{F}(X)\). What this means is that they can basically ignore \(X\).
- instead of looking at the map \(\widetilde{F}(X)\), let us instead just focus on the distribution of success per group: \(\pi(V \given G) \in [0,1]\) collapses everything and just considers the success probability.
  - it’s really just \(P(\widetilde{F}(X) \given G)\).
\(Z\) is a little bit complicated in this case.
- this next function is the ML “model”, which is given by \(\tau(V,G): \left\{ 0,1 \right\}\times\left\{ A,B \right\} \to [0,1]\), assigning individuals (completely determined by their coefficients, remember) a probability of selection. All of this boils down to \(Z = F(X,G) = \tau(\widetilde{F}(X), G)\).
  - the flexibilty comes in the \(\tau\), choosing what probability to assign individuals.
- in some sense, basically \(Y=Z\) in this case, at least from a utility perspective, since there’s no ambiguity in terms of the success of an individual.
\(\phi\): with their deterministic success function, they are able to basically measure success directly.44 Originally, I thought their measure was actually based on the ML model (or, in their case, the institution that performs the selection). But it turns out that’s not the case.
- This metric, \(\pi(\,\cdot \given G)\) is essentially what they track. And, as I said earlier, that’s actually the whole population encapsulated/reduced into these 4 probabilities.
\(D\): the metric allows us to write down the feedback mechanism
- \(\pi^{next}(1) = \pi(1) f_1(\beta(0), \beta(1)) + \pi(0) f_0(\beta(0), \beta(1))\), where the \(\beta\)’s are the selection rates (convolving success rates with institution selection).55 Here is where I first got confused, because this doesn’t say anything about how the \(X\) or \(\theta\)’s change. But they don’t care about the nitty-gritty of that. It can change in complicated ways, but all of the changes are expressed through the crucial change, which is the probability of success.

Additional things that they consider:

(average) institutional utility, given by \(U(\tau)\).

Question: what are the pros/cons of this formulation of the social-aware machine learning problem?

Table of Examples

Example	Z
Hiring	P(success)
Recidivism	P(reoffend)
Credit	P(default)
Ads	relevance/stickyness score -> ranking
CS Majors	rank in class \| P(graduate)

Classical / Standard Machine Learning

Pre-fairness, the standard framework for ML is essentially that of a prediction problem.

Online Learning and Bandit Theory are different frameworks that apply to sequential data that comes as a stream of data.