Suppose you go and train the latest, greatest machine learning architecture to predict something important. Say (to pick an example entirely out of thin air) you are in the midst of a pandemic, and want to predict the severity of patients’ symptoms in 2 days time, so as to triage scarce medical resources. Since you will be using these predictions to make decisions, you would like them to be accurate in various ways: for example, at the very least, you will want your predictions to be calibrated, and you may also want to be able to accurately quantify the uncertainty of your predictions (say with 95% prediction intervals). It is a fast moving situation, and data is coming in dynamically — and you need to make decisions as you go. What can you do?

The first thing you might do is ask on twitter! What you will find is that the standard tool for quantifying uncertainty in settings like this is conformal prediction. The conformal prediction literature has a number of elegant techniques for endowing arbitrary point prediction methods with *marginal prediction intervals*: i.e intervals such that over the randomness of some data distribution over labelled examples : These would be 95% marginal prediction intervals — but in general you could pick your favorite coverage probability .

Conformal prediction has a lot going for it — its tools are very general and flexible, and lead to practical algorithms. But it also has two well known shortcomings:

**Strong Assumptions**. Like many tools from statistics and machine learning, conformal prediction methods require that the future look like the past. In particular, they require that the data be drawn i.i.d. from some distribution — or at least be*exchangable*(i.e. their distribution should be invariant to permutation). This is sometimes the case — but it often is not. In our pandemic scenario, the distribution on patient features might quickly change in unexpected ways as the disease moves between different populations, as might the relationship between features and outcomes, as treatments advance. In other settings in which consequential decisions are being made about people — like lending and hiring decisions — people might intentionally manipulate their features in response to the predictive algorithms you deploy, in an attempt to get the outcome they want. Or you might be trying to predict outcomes in time series data, in which there are explicit dependencies across time. In all of these scenarios, exchangeability is violated.**Weak Guarantees**. Marginal coverage guarantees are*averages over people*. 95% marginal coverage means that the true label falls within the predicted interval for 95% of people. It need not mean anything for*people like you*. For example, if you are part of a demographic group that makes up less than 5% of the population, it is entirely consistent with the guarantees of a 95% marginal prediction interval that labels for people from your demographic group fall outside of their intervals 100% of the time. This can be both an accuracy and aconcern — marginal prediction works well for “typical” members of a population, but not necessarily for everyone else.*fairness*

What kinds of improvements might we hope for? Lets start with how to strengthen the guarantee:

**Multivalidity** Ideally, we would want *conditional* guarantees — i.e. the promise that for every , that we would have . In other words, that somehow for each individual, the prediction interval was valid for them specifically, over the “unrealized” (or unmeasured) randomness of the world. Of course this is too much to hope for. In a rich feature space, we have likely never seen anyone exactly like you before (i.e. with your feature vector ). So strictly speaking, we have no information at all about your conditional label distribution. We still have to average over people. But we don’t have to average over everybody. An important idea that has been investigated in several different contexts in recent years in the theory literature on fairness is that we might articulate a very rich collection of (generally intersecting) demographic groups corresponding to relevant subsets of the data domain, and ask for things that we care about to hold true as averaged over any group in the collection. In the case of prediction intervals, this would correspond to asking for something like that simultaneously for every demographic group , . Note here that an individual might be a member of many different demographic groups, and can interpret the guarantees of their prediction interval as averages over any of those demographic groups, at their option. This is what we can achieve — at least for any such group that isn’t too small.

And what kinds of assumptions do we need?

**Adversarial Data **Actually, its not clear that we need any! Many learning problems which initially appear to require distributional assumptions turn out to be solvable even in the worst case over data sequences — i.e. even if a clever adversary, with full knowledge of your algorithm, and with the intent only to sabotage your learning guarantees, is allowed to adaptively choose data to present to your algorithm. This is the case for calibrated weather prediction, as well as general contextual prediction. It turns out to be the case for us as well. Instead of promising coverage probabilities of after rounds *on the underlying distribution*, as conformal prediction is able to, we offer *empirical* coverage rates of (since for us there is no underlying distribution). This kind of guarantee is quite similar to what conformal prediction guarantees about empirical coverage.

**More Generally **Our techniques are not specific to prediction intervals. We can do the same thing for many other distributional quantities. We work this out in the case of predicting label means, and predicting variances of the residuals of arbitrary prediction methods. For mean prediction, this corresponds to an algorithm for providing multi-calibrated predictions in the sense of Hebert-Johnson et al, in an online adversarial environment. For variances and other higher moments, it corresponds to an online algorithm for making mean-conditioned moment multicalibrated predictions in the sense of Jung et al.

**Techniques** At the risk of boring my one stubbornly remaining reader, let me say a few words about how we do it. We generalize an idea that dates back to an argument that Fudenberg and Levine first made in 1995 — and is closely related to an earlier, beautiful argument by Sergiu Hart — but that I just learned about this summer, and thought was just amazing. It applies broadly to solving any prediction task that would be easy, if only you were facing a known data distribution. This is the case for us. If, for each arriving patient at our hospital, a wizard *told us* their “true” distribution over outcome severity, we could easily make calibrated predictions by always predicting the mean of this distribution — and we could similarly read off correct 95% coverage intervals from the CDF of the distribution. So what? That’s not the situation we are in, of course. Absent a wizard, we first need to commit to some learning algorithm, and only then will the adversary decide what data to show us.

But lets put our game theory hats on. Suppose we’ve been making predictions for awhile. We can write down some measure of our error so far — say the maximum, over all demographic groups in , of the deviation of our empirical coverage so far from our 95% coverage target. For the next round, define a zero sum game, in which we (the learner) want to minimize the *increase* in this measure of error, and the adversary wants to maximize it. The defining feature of zero-sum games is that how well you can do in them is independent of which player has to announce their distribution on play first — this is the celebrated Minimax Theorem. So to evaluate how well the learner could do in this game, we can think about the situation involving a Wizard above, in which for each arriving person, before we have to make a prediction for them, we get to observe their true label distribution. Of course in this scenario we can do well, because for all of our goals, our measure of success is based on how well our predictions match observed properties of these distributions. The Minimax theorem tells us that (at least in principle — it doesn’t give us the algorithm), there must therefore also be a learning algorithm that can do just as well, but against an adversary.

The minimax argument is slick, but non-constructive. To actually pin down a concrete algorithm, we need to solve for the equilibrium in the corresponding game. That’s what we spend much of the paper doing, for each of the prediction tasks that we study. For multicalibration, we get a simple, elementary algorithm — but for the prediction interval problem, although we get a polynomial time algorithm, it involves solving a linear program with a separation oracle at each round. Finding more efficient and practical ways to do this strikes me as an important problem.

Finally, I had more fun writing this paper — learning about old techniques from the game theoretic calibration literature — than I’ve had in awhile. I hope a few people enjoy reading it!

]]>If you want to join our seminar, please email toc4fairness-director@cs.stanford.edu and we will add you to our email list (which we will be careful not to overuse).

]]>**Good luck on your EC and FORC submissions! ****The seminar will resume on February 17th. **

**Date: **Wednesday, February 3rd, 2021

9:00 am – 10:00 am Pacific Time

12:00 pm – 1:00 pm Eastern Time

**Location: **Weekly Seminar, Zoom

We show how to achieve the notion of “multicalibration” from Hébert-Johnson et al. [2018] not just for means, but also for variances and other higher moments. Informally, it means that we can find regression functions which, given a data point, can make point predictions not just for the expectation of its label, but for higher moments of its label distribution as well-and those predictions match the true distribution quantities when averaged not just over the population as a whole, but also when averaged over an enormous number of finely defined subgroups. It yields a principled way to estimate the uncertainty of predictions on many different subgroups-and to diagnose potential sources of unfairness in the predictive power of features across subgroups. As an application, we show that our moment estimates can be used to derive marginal prediction intervals that are simultaneously valid as averaged over all of the (sufficiently large) subgroups for which moment multicalibration has been obtained.

This talk is based on a paper that is joint work with Changhwa Lee, Mallesh M. Pai, Aaron Roth, and Rakesh Vohra.

Christopher Jung is a 4th year PhD student in the department of Computer and Information Sciences at the University of Pennsylvania, where he is fortunate to be advised by Aaron Roth and Michael Kearns. He is generally interested in algorithmic fairness, learning theory, privacy, and algorithmic game theory.

]]>**Date: **Wednesday, January 27th, 2021

9:00 am – 10:00 am Pacific Time

12:00 pm – 1:00 pm Eastern Time

**Location: **Weekly Seminar, Zoom

We present a general, efficient technique for providing contextual predictions that are “multivalid” in various senses, against an online sequence of adversarially chosen examples (x,y). This means that the resulting estimates correctly predict various statistics of the labels y not just marginally — as averaged over the sequence of examples — but also conditionally on x \in G for any G belonging to an arbitrary intersecting collection of groups.

We provide three instantiations of this framework. The first is mean prediction, which corresponds to an online algorithm satisfying the notion of multicalibration from Hebert-Johnson et al.. The second is variance and higher moment predictions, which corresponds to an online algorithm satisfying the notion of mean-conditioned moment multicalibration from Jung et al. Finally, we define a new notion of prediction interval multivalidity, and give an algorithm for finding prediction intervals which satisfy it. Because our algorithms handle adversarially chosen examples, they can equally well be used to predict statistics of the residuals of arbitrary point prediction methods, giving rise to very general techniques for quantifying the uncertainty of predictions of black box algorithms, even in an online adversarial setting. When instantiated for prediction intervals, this solves a similar problem as conformal prediction, but in an adversarial environment and with multivalidity guarantees stronger than simple marginal coverage guarantees.

This talk is based on a paper that is joint work with Varun Gupta, Christopher Jung, Georgy Noarov, and Mallesh Pai.

Aaron Roth is a professor of Computer and Information Sciences at the University of Pennsylvania, affiliated with the Warren Center for Network and Data Science, and co-director of the Networked and Social Systems Engineering (NETS) program. He is also an Amazon Scholar at Amazon AWS. He is the recipient of a Presidential Early Career Award for Scientists and Engineers (PECASE) awarded by President Obama in 2016, an Alfred P. Sloan Research Fellowship, an NSF CAREER award, and research awards from Yahoo, Amazon, and Google. His research focuses on the algorithmic foundations of data privacy, algorithmic fairness, game theory, and machine learning. Together with Cynthia Dwork, he is the author of the book “The Algorithmic Foundations of Differential Privacy.” Together with Michael Kearns, he is the author of “The Ethical Algorithm”.

]]>**Date: **Wednesday, January 20th, 2021

9:00 am – 10:00 am Pacific Time

12:00 pm – 1:00 pm Eastern Time

**Location: **Weekly Seminar, Zoom

Much of the recent literature on algorithmic fairness in computer science and applied statistics has focused on optimizing the *quality *of decision outcomes reached by implementing algorithmic decision-making models. The guiding question is: does the model adhere to a number of plausible fairness metrics, mathematically defined, and does this enable the model to reach decision outcomes that are qualitatively better than decisions reached by a human decision-maker, or a competing algorithmic model?

Given available evidence that algorithmic decision-making in many different domains of deployment leads to outcomes that reflect and amplify social inequalities, such as structures of racial and gender inequality, these are the right questions to ask about algorithmic decision-making models—but they are not the only ones, and often not the most important ones. If what we care about is *fairness*, we have to move beyond an approach that focuses exclusively on the decision quality of algorithmic models. In addition to evaluating decision quality for each algorithmic model, we ought to critically scrutinize the *decision landscape*. Doing so requires investigating not only which alternative decision outcomes are available, but also which alternative decision problems we could, and should, be solving with the help of algorithmic models. This is an underexplored approach to algorithmic fairness, as it requires thinking beyond the internal optimization of a given model, and instead taking into account interactions between models and the model-external social world.

After briefly sketching the state of the contemporary debate on decision quality optimization with respect to algorithmic fairness, I develop three arguments for *why *scrutinizing available decision landscapes matters in our pursuit of algorithmic fairness: first, the *Benchmarking Argument*; second, the *Aliefs Argument*; and third, the *Quality-Independence Argument*. There are two important upshots. Neither one has received sufficient explicit attention, either in the philosophical literature or the literature in computer science. First, assessing the quality of algorithmic decision outcomes is insufficient for assessing algorithmic fairness. The direction of future research on algorithmic fairness must be responsive to this problem. Second, considering decision landscapes in conjunction with decision quality has implications for the question of whether we ought to deploy a given algorithmic tool in a given domain at all.

Dr Annette Zimmermann is a Lecturer (Assistant Professor) in Philosophy at the University of York, and a Technology & Human Rights Fellow at Harvard University. Dr Zimmermann’s current research focuses on the political and moral philosophy of AI and machine learning.

Before that, Dr Zimmermann was a postdoctoral fellow at Princeton University (2018-2020), with a joint appointment at the Center for Human Values and the Center for Information Technology Policy. Prior to that, they were awarded a DPhil from Nuffield College at the University of Oxford, for work focusing on contemporary analytic political and moral philosophy—in particular, democratic decision-making, justice, and risk.

Dr Zimmermann’s recent research visitor positions include Yale University (2016), the Australian National University (2019) and Stanford University (2020). They have advised policy-makers on AI ethics issues at UNESCO, the Australian Human Rights Commission, the UK Centre for Data Ethics and Innovation, and the OECD. In recognition of their research, Dr Zimmermann has received the 2020 David Roscoe Early Career Award in Science, Ethics, and Society by the Hastings Center, and they have been named on the 2021 “100 Brilliant Women in AI Ethics” List.

]]>