Self-fulfilling and self-negating predictions: a short tale of performativity in machine learning

This post is based on results and discussions from a series of joint works with Moritz Hardt, Celestine Mendler-Dünner, John Miller, and Juan C. Perdomo.

In 1998, Michel Callon wrote what would be the first in an ongoing series of controversial publications in economic sociology [1]. He was the first to propose the idea that “the economy is not embedded in society but in economics”. With this, he challenged the conventional view that economic theories and models passively observe markets and infer their behavior, just like laws of physics passively describe the principles governing natural phenomena. Instead, Callon argued that economic theories are performative: they induce the economy, creating the phenomena they aim to describe.

One example that is often cited in support of Callon’s claims is the impact of the celebrated Black-Scholes-Merton options pricing model [2, 3]. MacKenzie and Millo [4] investigated the role of this model in the economy and found that it “made itself true”. In their words,

“Black, Scholes, and Merton’s model did not describe an already existing world: when first formulated, its assumptions were quite unrealistic, and empirical prices differed systematically from the model. Gradually, though, the financial markets changed in a way that fitted the model”.

Indeed, participants in the market started making decisions assuming the market obeys the mathematical laws implied by the Black-Scholes-Merton model. As MacKenzie and Millo put it, “pricing models came to shape the very way participants thought and talked about options”.

This phenomenon — whereby models and predictions inform decision-making and thus alter the target of prediction itself — is by no means special to economic forecasts.

Predictive policing, for example, develops algorithms that use historical data to estimate the likelihood of crime at a given location. Those locations where criminal behavior is deemed likely by the system typically get more police patrols and surveillance in general. In a kind of self-fulfilling prophecy [5], these actions resulting from prediction might further increase the perceived crime rate at the patrolled locations, thus biasing the data used for future decisions.

A similar feedback loop arises in traffic predictions, when drivers decide which route to take based on the estimated time of arrival (ETA) calculated by a traffic prediction system. If the predictive system estimates low ETA for a given route, many drivers take the route, potentially leading to an overflow of traffic and making the ETA prediction inaccurate as a result. Contrary to the previous example, traffic predictions arguably exhibit a self-negating prophecy: low ETA might imply a longer travel time, and vice versa.

While the previous examples deal with qualitatively different feedback mechanisms, the interplay of predictions and decision-making is similar. First, one uses historical data to build a predictive model. Then, the predictions of the model feed into and inform consequential decisions. Finally, these decisions trigger changes in the environment, making future observations differ from those in the initial dataset.

We refer to prediction problems that exhibit this feedback-loop behavior as performative prediction problems.

In the language of machine learning, such a change in patterns would often be called distribution shift. Notably, however, performative distribution shift is not due to external factors independent of the model, such as, say, when traffic patterns change due to seasonal effects. Rather, the distribution shift is triggered directly by the choice of predictive model. (Of course, distribution shifts can also be caused by a combination of external factors and model choice.)

To formalize performative prediction mathematically, it is instructive to contrast performative prediction problems with supervised learning problems. In supervised learning, the decision-maker observes pairs of features and outcomes drawn from a fixed distribution . The key difference in performative prediction is that there is no longer an unknown static distribution generating observations; rather, data is drawn from a model-dependent distribution , where is a parameter vector specifying the deployed model. For example, could be the weights of a neural network, or a vector of linear regression coefficients. For a given choice of parameters , should be thought of as the distribution over features and outcomes that results from making decisions according to the model specified by . In the context of the traffic prediction example, could be a distribution over traffic conditions and travel times, given that drivers make routing decisions in response to ETA forecasts by model .

In supervised learning, the quality of a model is typically measured by its risk, namely, the expected loss of the model on instances from distribution as measured via a loss function :

Since performative prediction does not admit one true data-generating distribution, but rather a family of distributions , evaluating a model calls for a new risk concept. Arguably the most natural counterpart of the risk in supervised learning is the expected loss on the distribution that arises once the model is deployed and feeds into consequential decisions. This leads to the notion of performative risk, defined as:

Adopting the performative risk as the single overarching measure of quality, a model would be optimal if it minimizes the performative risk. While an appealing solution concept, performative optimality is difficult to achieve, seeing the double dependence of the risk function on . One of the main computational difficulties is the fact that, even if the loss is convex in , need not be convex. Prior work on strategic classification [6] implies a set sufficient conditions for to be convex in a binary classification context, and in recent work [7] we identified a complementary set of conditions when the family of distributions forms an appropriate location-scale family. That said, in many practical settings convexity might be an unrealistic and unnecessarily strong guarantee to aim for. As we know from present-day machine learning, even non-convex problems can sometimes be amenable to simple optimization algorithms. Understanding the optimization landscape of the performative risk beyond convex settings is a fruitful direction going forward.

A seemingly less ambitious target is to find a model that is locally optimal in some appropriate sense. For example, one could optimize for models that are optimal on the distribution that they induce:

We call a model that satisfies the fixed-point equation above performatively stable. Performative stability arises naturally when the decision-maker applies the heuristic of myopically updating the model based on the distribution resulting from the previous deployment:

If this retraining strategy converges, then it necessarily converges to a performatively stable solution. This is an appealing property, since it says that stability eliminates the need for retraining. Several existing works [8, 9, 10] have identified necessary and sufficient conditions for the above retraining heuristic, and some of its efficient approximations, to converge to a stable point. Roughly speaking, retraining converges to a stable solution if the loss is well-behaved and the performative feedback effects are not too strong. If either of those two conditions is violated, there is no guarantee of convergence.

In the language of game theory, one can think of performative prediction as a two-player game between a decision-maker, who decides which predictive model to deploy, and the model’s environment, which generates observations according to . If is thought of as the “best response” (according to some underlying utility) of the model’s environment to the deployment of model , then a performatively stable solution corresponds to a Nash equilibrium, while a performatively optimal solution corresponds to a Stackelberg equilibrium with the decision-maker acting as the leader.

Only in special cases, such as in well-behaved zero-sum games, it is known that Nash equilibria coincide with Stackelberg equilibria. Therefore, whenever performative prediction is a well-behaved zero-sum game, all stable solutions are also performatively optimal. However, performative prediction is typically not a zero-sum game. For example, if the decision-maker’s loss simply measures predictive accuracy, it seems odd that the environment’s primary objective is to hurt the model’s accuracy. Indeed, a typical performative prediction problem is a general-sum game without much structure. This implies that stable solutions and performative optima can be very different. And, since naive retraining strategies only converge to stability, this means that such myopic updates can be an inadequate method of overcoming performative distribution shifts and achieving low performative risk. This observation further motivates understanding the optimization landscape of the performative risk, as well as developing efficient algorithms for optimizing it. Recent work has explored several algorithmic solutions [11, 7], appropriate in convex settings.

Performative prediction relates to many other areas beyond game theory, including bandits, reinforcement learning, control theory. These frameworks are flexible enough to capture performative prediction as a special case, however performativity arises via distinctive feedback mechanisms and as such deserves its own specialized analysis. There is a long way to go in understanding the properties of performative distribution shifts, how they connect to feedback mechanisms in other disciplines, and how to tackle these shifts in practice. Furthermore, it is unclear whether a single distribution is expressive enough to describe the observations after model deployment; in practice there are different kinds of memory effects [12] and self-reinforcing loops that make the data distribution evolve with time, even when the model is kept fixed. Finally, to make the existing theoretical insights actionable, going forward we need to think about what is the right solution concept — both statistically and ethically — to optimize for in performative settings.

[1] M. Callon. Introduction: the embeddedness of economic markets in economics. The Sociological Review, 1998
[2] F. Black, M. Scholes. The pricing of options and corporate liabilities. The Journal of Political Economy, 1973
[3] R. C. Merton. Theory of rational option pricing. The Bell Journal of Economics and Management Science, 1973
[4] D. MacKenzie, Y. Millo. Constructing a market, performing theory: The historical sociology of a financial derivatives exchange. American Journal of Sociology, 2003
[5] D. Ensign, S. A. Friedler, S. Neville, C. Scheidegger, S. Venkatasubramanian. Runaway feedback loops in predictive policing. ACM Conference on Fairness, Accountability and Transparency, 2018
[6] J. Dong, A. Roth, Z. Schutzman, B. Waggoner, Z. S. Wu. Strategic classification from revealed preferences. ACM Conference on Economics and Computation, 2018
[7] J. Miller, J. C. Perdomo, T. Zrnic. Outside the echo chamber: Optimizing the performative risk. arXiv preprint, 2021
[8] J. C. Perdomo, T. Zrnic, C. Mendler-Dünner, M. Hardt. Performative prediction. International Conference on Machine Learning, 2020
[9] C. Mendler-Dünner, J. C. Perdomo, T. Zrnic, M. Hardt. Stochastic optimization for performative prediction. Conference on Neural Information Processing Systems, 2020
[10] D. Drusvyatskiy, L. Xiao. Stochastic optimization with decision-dependent distributions. arXiv preprint, 2020
[11] Z. Izzo, L. Ying, J. Zou. How to learn when data reacts to your model: Performative gradient descent. arXiv preprint, 2021
[12] G. Brown, S. Hod, I. Kalemaj. Performative prediction in a stateful world. arXiv preprint, 2020

Leave a ReplyCancel reply

Discover more from TOC for Fairness