The first 3 days will focus on theory, and the last two days will transition towards practice and will feature lectures by domain experts in machine learning, medicine and economics. Participants will need to apply here. The application closes on March 11, 2022. See the website and the application link for further details.

We are thrilled to announce that the registration for EAAMO ‘21 is now live! Please register for regular admission on Eventbrite by September 10, 2021.

**Conference registration** is $20 for ACM members, $15 for students, and $35 for non-ACM members. We also provide **financial assistance and data grants** in order to waive registration fees and provide data plans to facilitate virtual attendance. Please apply here before September 10, 2021.

A main goal of the conference is to bridge research and practice. Please nominate practitioners working with underserved and disadvantaged communities to join us at the conference (you can also nominate yourself if you are a practitioner). Invited practitioners will be included in facilitated discussions with researchers.

For more information, please see below or visit our website and contact us at gc@eaamo.org with any questions.

***

The inaugural **Conference on ****Equity and Access in Algorithms, Mechanisms, and Optimization** (EAAMO ‘21) will take place on October 5-9, 2021, virtually, on Zoom and Gather.town. EAAMO ‘21 will be sponsored by ACM SIGAI and SIGecom.

The goal of this event is to highlight work where techniques from algorithms, optimization, and mechanism design, along with insights from the social sciences and humanistic studies, can improve access to opportunity for historically underserved and disadvantaged communities.

The conference aims to foster a multi-disciplinary community, facilitating interactions between academia, industry, and the public and voluntary sectors. The program will feature keynote presentations from researchers and practitioners as well as contributed presentations in the research and policy & practice tracks.

We are excited to host a series of **keynote speakers** from a variety of fields: Solomon Assefa (IBM Research), Dirk Bergemann (Yale University), Ellora Derenoncourt (University of California, Berkeley), Ashish Goel (Stanford University), Mary Gray (Microsoft Research), Krishna Gummadi (Max Planck Institute for Software Systems), Avinatan Hassidim (Bar Ilan University), Radhika Khosla (University of Oxford), Sylvia Ortega Salazar (National College of Vocational and Professional Training), and Trooper Sanders (Benefits Data Trust).

*ACM EAAMO is part of the **Mechanism Design for Social Good** (MD4SG) initiative,** and builds on the MD4SG technical **workshop series** and tutorials at conferences including ACM EC, ACM COMPASS, ACM FAccT, and WINE.*

The first 3 days will focus on theory, and the last two days will transition towards practice and will feature lectures by domain experts in machine learning, medicine and economics. Participants will need to apply here. The application closes on March 11, 2022. See the website and the application link for further details.

The program will feature 24 papers, six exciting panel discussions, three social hours featuring interactive board games, keynotes by Julie Owono (of the Facebook Oversight Board) and Kate Crawford (on her new book, the Atlas of AI), and a mentoring meetup.

The full program is here: https://responsiblecomputing.org/forc-2021-program/

The Symposium on Foundations of Responsible Computing (FORC) is a forum for mathematical research in computation and society writ large. The Symposium aims to catalyze the formation of a community supportive of the application of theoretical computer science, statistics, economics and other relevant analytical fields to problems of pressing and anticipated societal concern.

]]>Clustering is possibly the most fundamental problem of unsupervised learning. Like many other paradigms of machine learning, there has been a focus on fair variants of clustering. Perhaps the definition which has received the most attention is the group fairness definition of [1]. The notion is based on disparate impact and simply states that each cluster should contain points belonging to the different demographic groups with “appropriate” proportions. A natural interpretation of appropriate would imply that each demographic group appears in close to population-level proportions in each cluster. More specifically, if we were to endow each point with a color to designate its group membership and we were to consider the -means clustering objective, then this notion of fair clustering amounts to the following constrained optimization problem:

Here, and are the lower and upper pre-set proportionality bounds for color , denotes the points in cluster , and denotes the subset of those points with color . See figure 1 for a comparison between the outputs of color-agnostic and fair clustering.

If one were to use clustering for market segmentation and targeted advertisement, then the above definition of fair clustering would roughly ensure that each demographic group receives the same exposure to every type of ad. Similarly if we were to cluster news articles and let the source of each article indicate its membership then we could ensure that each cluster has a good mixture of news from different sources [2].

Significant progress has been made in this notion of fair clustering starting from only considering the two color case and under-representation bounds, to the multi-color case with both under- and over-representation bounds [3.4.5]. Scalable methods for larger datasets have also been proposed [6, 7].

Clearly, like the majority of the methods in group-fair supervised learning, it is assumed that the group membership of each point in the dataset is known. This setting conflicts with a common situation in practice where group memberships are either imperfectly known or completely unknown [8,9,10,11,12]. We take the first step in generalizing fair clustering to this setting; specifically, we assume that while we do not know the exact group membership of each point, we instead have a probability distribution over the group memberships. A natural generalization of the previous optimization problem would be the following:

Where the proportionality constraints were simply changed to hold in expectation instead of deterministically. Clearly, this constraint reduces to the original constraint when the group memberships are completely known. Figure 2 helps visualize how the input to probabilistic fair clustering looks like and the output we expect.

Despite the innocuous modification to the constraint, the problem becomes significantly more difficult. In our paper, we consider the center-based clustering objectives of -center, -median, and -means and produce solutions with approximation ratio guarantees for two given cases:

**Two-Color Case**: We see that even the two color case is not easy to handle. The key difficulty lies in the rounding method. However, we give a rounding method that maintains the fairness constraint with a worst-case additive violation of 1 matching the deterministic fair clustering case.**Multi-Color Case with Large Enough Clusters**: At a high level, if the clusters have a sufficiently large size then through a Chernoff bound we can show that independent sampling would result in a deterministic fair clustering instance which we could solve using deterministic fair clustering algorithms. This essentially forms a reduction from the probabilistic to the deterministic instance.

While our solutions perform well empirically, we are left with a collection of problems. For example, guaranteeing that the color proportions are maintained in expectation is not the best constraint one should hope for, since when the colors are realized a cluster could entirely consist of one color. A more preferable constraint would instead bound the probability of obtaining an “unfair” clustering. Moreover, a setting that assumes access to the probability distribution for a given point over all colors could still be assuming too much. A more reasonable setting could instead take a robust-optimization-based approach, where we have the distribution of each point but allow the distribution of each point to belong to an uncertainty set. This effectively allows our probabilistic knowledge to be imperfect as well—as could be the case if, for example, a machine learning model were predicting group membership with a systematic bias against a particular subset of colors. Lastly, being able to handle the multi-color case in an assumption-free manner would also be interesting.

**References:**

- Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In Advances in Neural Information Processing Systems, 2017.
- Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. Clustering without over-representation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.
- Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In the International Workshop on Approximation and Online Algorithms, 2019.
- Ioana O. Bercea, Martin Groß, Samir Khuller,
*Aounon Kumar*, Clemens Rösner, Daniel R. Schmidt, Melanie Schmidt. On the cost of essentially fair clusterings, In the International Conference on Approximation Algorithms for Combinatorial Optimization Problems 2019. - Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In Advances in Neural Information Processing Systems, 2019.
- Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. In the International Conference on Machine Learning, 2019.
- Lingxiao Huang, Shaofeng Jiang, and Nisheeth Vishnoi. Coresets for clustering with fairness constraints. In Advances in Neural Information Processing Systems, 2019.
- Pranjal Awasthi, Matth¨aus Kleindessner, and Jamie Morgenstern. Equalized odds postprocessing under imperfect group information. In the International Conference on Artificial Intelligence and Statistics, 2020.
- Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, NithumThain, Xuezhi Wang, and Ed Chi. Fairness without demographics through adversarially reweighted learning. In Advances in Neural InformationProcessing Systems, 2020.
- David Pujol, Ryan McKenna, Satya Kuppam, Michael Hay, AshwinMachanavajjhala, and Gerome Miklau. Fair decision making using privacy-protected data. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 2020.
- Hussein Mozannar, Mesrob Ohannessian, and Nathan Srebro. Fair learning with private demographic data. In the International Conference on Machine Learning, 2020.
- Nathan Kallus, Xiaojie Mao, and Angela Zhou. Assessing algorithmic fairness with unobserved protected class using data combination. Management Science, 2021.

*This post is based on results and discussions from a series of joint works with Moritz Hardt, Celestine Mendler-Dünner, John Miller, and Juan C. Perdomo.*

In 1998, Michel Callon wrote what would be the first in an ongoing series of controversial publications in economic sociology [1]. He was the first to propose the idea that “the economy is not embedded in society but in economics”. With this, he challenged the conventional view that economic theories and models passively observe markets and infer their behavior, just like laws of physics passively describe the principles governing natural phenomena. Instead, Callon argued that economic theories are *performative:* they induce the economy, creating the phenomena they aim to describe.

One example that is often cited in support of Callon’s claims is the impact of the celebrated Black-Scholes-Merton options pricing model [2, 3]. MacKenzie and Millo [4] investigated the role of this model in the economy and found that it “made itself true”. In their words,

*“Black, Scholes, and Merton’s model did not describe an already existing world: when first formulated, its assumptions were quite unrealistic, and empirical prices differed systematically from the model. Gradually, though, the financial markets changed in a way that fitted the model”.*

Indeed, participants in the market started making decisions assuming the market obeys the mathematical laws implied by the Black-Scholes-Merton model. As MacKenzie and Millo put it, “pricing models came to shape the very way participants thought and talked about options”.

This phenomenon — whereby models and predictions inform decision-making and thus alter the target of prediction itself — is by no means special to economic forecasts.

Predictive policing, for example, develops algorithms that use historical data to estimate the likelihood of crime at a given location. Those locations where criminal behavior is deemed likely by the system typically get more police patrols and surveillance in general. In a kind of self-fulfilling prophecy [5], these actions resulting from prediction might further increase the *perceived* crime rate at the patrolled locations, thus biasing the data used for future decisions.

A similar feedback loop arises in traffic predictions, when drivers decide which route to take based on the estimated time of arrival (ETA) calculated by a traffic prediction system. If the predictive system estimates low ETA for a given route, many drivers take the route, potentially leading to an overflow of traffic and making the ETA prediction inaccurate as a result. Contrary to the previous example, traffic predictions arguably exhibit a self-negating prophecy: low ETA might imply a longer travel time, and vice versa.

While the previous examples deal with qualitatively different feedback mechanisms, the interplay of predictions and decision-making is similar. First, one uses historical data to build a predictive model. Then, the predictions of the model feed into and inform consequential decisions. Finally, these decisions trigger changes in the environment, making future observations differ from those in the initial dataset.

We refer to prediction problems that exhibit this feedback-loop behavior as *performative prediction* problems.

In the language of machine learning, such a change in patterns would often be called *distribution shift*. Notably, however, performative distribution shift is not due to external factors independent of the model, such as, say, when traffic patterns change due to seasonal effects. Rather, the distribution shift is triggered directly by the choice of predictive model. (Of course, distribution shifts can also be caused by a combination of external factors and model choice.)

To formalize performative prediction mathematically, it is instructive to contrast performative prediction problems with supervised learning problems. In supervised learning, the decision-maker observes pairs of features and outcomes drawn from a *fixed* distribution . The key difference in performative prediction is that there is no longer an unknown static distribution generating observations; rather, data is drawn from a *model-dependent distribution* , where is a parameter vector specifying the deployed model. For example, could be the weights of a neural network, or a vector of linear regression coefficients. For a given choice of parameters , should be thought of as the distribution over features and outcomes that results from making decisions according to the model specified by . In the context of the traffic prediction example, could be a distribution over traffic conditions and travel times, given that drivers make routing decisions in response to ETA forecasts by model .

In supervised learning, the quality of a model is typically measured by its *risk*, namely, the expected loss of the model on instances from distribution as measured via a loss function :

Since performative prediction does not admit one true data-generating distribution, but rather a family of distributions , evaluating a model calls for a new risk concept. Arguably the most natural counterpart of the risk in supervised learning is the expected loss on the distribution that arises once the model is deployed and feeds into consequential decisions. This leads to the notion of *performative risk*, defined as:

Adopting the performative risk as the single overarching measure of quality, a model would be optimal if it minimizes the performative risk. While an appealing solution concept, performative optimality is difficult to achieve, seeing the double dependence of the risk function on . One of the main computational difficulties is the fact that, even if the loss is convex in , need not be convex. Prior work on strategic classification [6] implies a set sufficient conditions for to be convex in a binary classification context, and in recent work [7] we identified a complementary set of conditions when the family of distributions forms an appropriate location-scale family. That said, in many practical settings convexity might be an unrealistic and unnecessarily strong guarantee to aim for. As we know from present-day machine learning, even non-convex problems can sometimes be amenable to simple optimization algorithms. Understanding the optimization landscape of the performative risk beyond convex settings is a fruitful direction going forward.

A seemingly less ambitious target is to find a model that is *locally* optimal in some appropriate sense. For example, one could optimize for models that are optimal on the distribution that they induce:

We call a model that satisfies the fixed-point equation above *performatively stable*. Performative stability arises naturally when the decision-maker applies the heuristic of myopically updating the model based on the distribution resulting from the previous deployment:

If this retraining strategy converges, then it necessarily converges to a performatively stable solution. This is an appealing property, since it says that stability eliminates the need for retraining. Several existing works [8, 9, 10] have identified necessary and sufficient conditions for the above retraining heuristic, and some of its efficient approximations, to converge to a stable point. Roughly speaking, retraining converges to a stable solution if the loss is well-behaved and the performative feedback effects are not too strong. If either of those two conditions is violated, there is no guarantee of convergence.

In the language of game theory, one can think of performative prediction as a two-player game between a decision-maker, who decides which predictive model to deploy, and the model’s environment, which generates observations according to . If is thought of as the “best response” (according to some underlying utility) of the model’s environment to the deployment of model , then a performatively stable solution corresponds to a *Nash* equilibrium, while a performatively optimal solution corresponds to a *Stackelberg* equilibrium with the decision-maker acting as the leader.

Only in special cases, such as in well-behaved zero-sum games, it is known that Nash equilibria coincide with Stackelberg equilibria. Therefore, whenever performative prediction is a well-behaved zero-sum game, all stable solutions are also performatively optimal. However, *performative prediction is typically not a zero-sum game*. For example, if the decision-maker’s loss simply measures predictive accuracy, it seems odd that the environment’s primary objective is to hurt the model’s accuracy. Indeed, a typical performative prediction problem is a general-sum game without much structure. This implies that stable solutions and performative optima can be *very different*. And, since naive retraining strategies only converge to stability, this means that such myopic updates can be an inadequate method of overcoming performative distribution shifts and achieving low performative risk. This observation further motivates understanding the optimization landscape of the performative risk, as well as developing efficient algorithms for optimizing it. Recent work has explored several algorithmic solutions [11, 7], appropriate in convex settings.

Performative prediction relates to many other areas beyond game theory, including bandits, reinforcement learning, control theory. These frameworks are flexible enough to capture performative prediction as a special case, however performativity arises via distinctive feedback mechanisms and as such deserves its own specialized analysis. There is a long way to go in understanding the properties of performative distribution shifts, how they connect to feedback mechanisms in other disciplines, and how to tackle these shifts in practice. Furthermore, it is unclear whether a single distribution is expressive enough to describe the observations after model deployment; in practice there are different kinds of memory effects [12] and self-reinforcing loops that make the data distribution evolve with time, even when the model is kept fixed. Finally, to make the existing theoretical insights actionable, going forward we need to think about what is the right solution concept — both statistically and ethically — to optimize for in performative settings.

[1] M. Callon. Introduction: the embeddedness of economic markets in economics. *The Sociological Review*, 1998

[2] F. Black, M. Scholes. The pricing of options and corporate liabilities. *The Journal of Political Economy*, 1973

[3] R. C. Merton. Theory of rational option pricing. *The Bell Journal of Economics and Management Science*, 1973

[4] D. MacKenzie, Y. Millo. Constructing a market, performing theory: The historical sociology of a financial derivatives exchange. *American Journal of Sociology*, 2003

[5] D. Ensign, S. A. Friedler, S. Neville, C. Scheidegger, S. Venkatasubramanian. Runaway feedback loops in predictive policing. *ACM Conference on Fairness, Accountability and Transparency*, 2018

[6] J. Dong, A. Roth, Z. Schutzman, B. Waggoner, Z. S. Wu. Strategic classification from revealed preferences. *ACM Conference on Economics and Computation*, 2018

[7] J. Miller, J. C. Perdomo, T. Zrnic. Outside the echo chamber: Optimizing the performative risk. *arXiv preprint*, 2021

[8] J. C. Perdomo, T. Zrnic, C. Mendler-Dünner, M. Hardt. Performative prediction. *International Conference on Machine Learning*, 2020

[9] C. Mendler-Dünner, J. C. Perdomo, T. Zrnic, M. Hardt. Stochastic optimization for performative prediction. *Conference on Neural Information Processing Systems*, 2020

[10] D. Drusvyatskiy, L. Xiao. Stochastic optimization with decision-dependent distributions. *arXiv preprint*, 2020

[11] Z. Izzo, L. Ying, J. Zou. How to learn when data reacts to your model: Performative gradient descent. *arXiv preprint*, 2021

[12] G. Brown, S. Hod, I. Kalemaj. Performative prediction in a stateful world. *arXiv preprint*, 2020

I won’t try to summarize the talk because I doubt that I can do it justice, but one of the themes (which I fully support) is that it is not enough to consider “fair” implementations of specific tasks. Instead, we (also) want to explore the right task to implement and if it is appropriate to implement any algorithmic task in any specific context.

As a side note, I loved Annette’s statement on ethics, “if we can choose, we’re on the hook.” For me, it beautifully complements the paradigm that “ought implies can.” In other words, ethical imperatives only exist when the expected action is possible but every choice has ethical implications.

Some resources for additional reading.

**Resources on exploitation**

**Introductory / very accessible for an interdisciplinary audience**

Nicholas Vrousalis, “Exploitation: A Primer,” *Philosophy Compass* 13, no. 2 (2018).

**Background**

G.A. Cohen, “The Labor Theory of Value and the Concept of Exploitation,” *Philosophy and Public Affairs* 8, no. 4 (1979): 338–360.

Joel Feinberg, *Harmless Wrongdoing*, Oxford: Oxford University Press (1988).`

Robert E. Goodin, “Exploiting a Situation and Exploiting a Person,” in Andrew Reeve (ed.), *Modern Theories of Exploitation*, London: Sage (1987), 166–200.

Ruth Sample, *Exploitation, What It Is and Why it is Wrong*, Lanham, MD: Rowman and Littlefield (2003).

Nicholas Vrousalis, “Exploitation, Vulnerability, and Social Domination,” *Philosophy and Public Affairs*, 41, no. 2 (2013): 131–157.

Alan Wertheimer, *Exploitation*, Princeton: Princeton University Press (1996).

Iris Marion Young, “Five Faces of Oppression,” in Thomas Wartenberg (ed.), *Rethinking Power*, Albany, NY: SUNY Press (1992).

**Resource on the political philosophy of AI (for a general audience)**

Annette Zimmermann, Elena Di Rosa, Hochan Kim, “Technology Can’t Fix Algorithmic Injustice, *Boston Review*

Suppose you go and train the latest, greatest machine learning architecture to predict something important. Say (to pick an example entirely out of thin air) you are in the midst of a pandemic, and want to predict the severity of patients’ symptoms in 2 days time, so as to triage scarce medical resources. Since you will be using these predictions to make decisions, you would like them to be accurate in various ways: for example, at the very least, you will want your predictions to be calibrated, and you may also want to be able to accurately quantify the uncertainty of your predictions (say with 95% prediction intervals). It is a fast moving situation, and data is coming in dynamically — and you need to make decisions as you go. What can you do?

The first thing you might do is ask on twitter! What you will find is that the standard tool for quantifying uncertainty in settings like this is conformal prediction. The conformal prediction literature has a number of elegant techniques for endowing arbitrary point prediction methods with *marginal prediction intervals*: i.e intervals such that over the randomness of some data distribution over labelled examples : These would be 95% marginal prediction intervals — but in general you could pick your favorite coverage probability .

Conformal prediction has a lot going for it — its tools are very general and flexible, and lead to practical algorithms. But it also has two well known shortcomings:

**Strong Assumptions**. Like many tools from statistics and machine learning, conformal prediction methods require that the future look like the past. In particular, they require that the data be drawn i.i.d. from some distribution — or at least be*exchangable*(i.e. their distribution should be invariant to permutation). This is sometimes the case — but it often is not. In our pandemic scenario, the distribution on patient features might quickly change in unexpected ways as the disease moves between different populations, as might the relationship between features and outcomes, as treatments advance. In other settings in which consequential decisions are being made about people — like lending and hiring decisions — people might intentionally manipulate their features in response to the predictive algorithms you deploy, in an attempt to get the outcome they want. Or you might be trying to predict outcomes in time series data, in which there are explicit dependencies across time. In all of these scenarios, exchangeability is violated.**Weak Guarantees**. Marginal coverage guarantees are*averages over people*. 95% marginal coverage means that the true label falls within the predicted interval for 95% of people. It need not mean anything for*people like you*. For example, if you are part of a demographic group that makes up less than 5% of the population, it is entirely consistent with the guarantees of a 95% marginal prediction interval that labels for people from your demographic group fall outside of their intervals 100% of the time. This can be both an accuracy and aconcern — marginal prediction works well for “typical” members of a population, but not necessarily for everyone else.*fairness*

What kinds of improvements might we hope for? Lets start with how to strengthen the guarantee:

**Multivalidity** Ideally, we would want *conditional* guarantees — i.e. the promise that for every , that we would have . In other words, that somehow for each individual, the prediction interval was valid for them specifically, over the “unrealized” (or unmeasured) randomness of the world. Of course this is too much to hope for. In a rich feature space, we have likely never seen anyone exactly like you before (i.e. with your feature vector ). So strictly speaking, we have no information at all about your conditional label distribution. We still have to average over people. But we don’t have to average over everybody. An important idea that has been investigated in several different contexts in recent years in the theory literature on fairness is that we might articulate a very rich collection of (generally intersecting) demographic groups corresponding to relevant subsets of the data domain, and ask for things that we care about to hold true as averaged over any group in the collection. In the case of prediction intervals, this would correspond to asking for something like that simultaneously for every demographic group , . Note here that an individual might be a member of many different demographic groups, and can interpret the guarantees of their prediction interval as averages over any of those demographic groups, at their option. This is what we can achieve — at least for any such group that isn’t too small.

And what kinds of assumptions do we need?

**Adversarial Data **Actually, its not clear that we need any! Many learning problems which initially appear to require distributional assumptions turn out to be solvable even in the worst case over data sequences — i.e. even if a clever adversary, with full knowledge of your algorithm, and with the intent only to sabotage your learning guarantees, is allowed to adaptively choose data to present to your algorithm. This is the case for calibrated weather prediction, as well as general contextual prediction. It turns out to be the case for us as well. Instead of promising coverage probabilities of after rounds *on the underlying distribution*, as conformal prediction is able to, we offer *empirical* coverage rates of (since for us there is no underlying distribution). This kind of guarantee is quite similar to what conformal prediction guarantees about empirical coverage.

**More Generally **Our techniques are not specific to prediction intervals. We can do the same thing for many other distributional quantities. We work this out in the case of predicting label means, and predicting variances of the residuals of arbitrary prediction methods. For mean prediction, this corresponds to an algorithm for providing multi-calibrated predictions in the sense of Hebert-Johnson et al, in an online adversarial environment. For variances and other higher moments, it corresponds to an online algorithm for making mean-conditioned moment multicalibrated predictions in the sense of Jung et al.

**Techniques** At the risk of boring my one stubbornly remaining reader, let me say a few words about how we do it. We generalize an idea that dates back to an argument that Fudenberg and Levine first made in 1995 — and is closely related to an earlier, beautiful argument by Sergiu Hart — but that I just learned about this summer, and thought was just amazing. It applies broadly to solving any prediction task that would be easy, if only you were facing a known data distribution. This is the case for us. If, for each arriving patient at our hospital, a wizard *told us* their “true” distribution over outcome severity, we could easily make calibrated predictions by always predicting the mean of this distribution — and we could similarly read off correct 95% coverage intervals from the CDF of the distribution. So what? That’s not the situation we are in, of course. Absent a wizard, we first need to commit to some learning algorithm, and only then will the adversary decide what data to show us.

But lets put our game theory hats on. Suppose we’ve been making predictions for awhile. We can write down some measure of our error so far — say the maximum, over all demographic groups in , of the deviation of our empirical coverage so far from our 95% coverage target. For the next round, define a zero sum game, in which we (the learner) want to minimize the *increase* in this measure of error, and the adversary wants to maximize it. The defining feature of zero-sum games is that how well you can do in them is independent of which player has to announce their distribution on play first — this is the celebrated Minimax Theorem. So to evaluate how well the learner could do in this game, we can think about the situation involving a Wizard above, in which for each arriving person, before we have to make a prediction for them, we get to observe their true label distribution. Of course in this scenario we can do well, because for all of our goals, our measure of success is based on how well our predictions match observed properties of these distributions. The Minimax theorem tells us that (at least in principle — it doesn’t give us the algorithm), there must therefore also be a learning algorithm that can do just as well, but against an adversary.

The minimax argument is slick, but non-constructive. To actually pin down a concrete algorithm, we need to solve for the equilibrium in the corresponding game. That’s what we spend much of the paper doing, for each of the prediction tasks that we study. For multicalibration, we get a simple, elementary algorithm — but for the prediction interval problem, although we get a polynomial time algorithm, it involves solving a linear program with a separation oracle at each round. Finding more efficient and practical ways to do this strikes me as an important problem.

Finally, I had more fun writing this paper — learning about old techniques from the game theoretic calibration literature — than I’ve had in awhile. I hope a few people enjoy reading it!

]]>If you want to join our seminar, please email toc4fairness-director@cs.stanford.edu and we will add you to our email list (which we will be careful not to overuse).

]]>