AI News Hub Logo

AI News Hub

Clarifying the role of the behavioral selection model

AI Alignment Forum
Alex Mallen

This is a brief elaboration on The behavioral selection model for predicting AI motivations, based on some feedback and thoughts I’ve had since publishing. Written quickly in a personal capacity. The main focus of this post is clarifying the basic machinery of the behavioral selection model, and conveying why it matters to disambiguate between different “motivations” for AI behavior. Very similar or identical behavior in training can correspond to radically different outcomes in deployment based on what motivated it. I’ll preface by saying: I think the behavioral selection model is quite predictive and useful to understand, especially in the short-medium term. But it leaves out some really important dynamics for predicting AI motivations, and I wish I had clarified this more in the original post. Most importantly (as Habryka mentioned), it leaves out the effect of reflection and deliberation on AI motivations (which I discuss a small amount in other pieces, and briefly at the end of this post). This might be the dominant cause of AI motivations! It also abstracts away or ignores a bunch of more concrete paths by which different motivations can arise (e.g., Anders considers a couple here). Clarifying the basic machinery Here’s a somewhat updated version of the causal graph from the behavioral selection model. It clears things up by making it a bit more concrete. The causal graph shows possible states of the world that the AI's actions can influence. Each node corresponds to a possible outcome in the actual world—in particular, possible consequences of the actions chosen by a certain cognitive pattern (CP). Crucially, one possible consequence is "this cognitive pattern has influence through deployment." To recap: we want to know which cognitive patterns will drive behavior in deployment (behavior in deployment is hugely consequential!), and there was some selection process that determined which cognitive patterns the AI ended up with (e.g., RL), so we look at the structure of that selection process represented in this causal graph. In particular, we’d like to ask, “Which cognitive patterns take actions during training that most cause themselves to be selected?”; or, in other words: “for which values of ‘CP’ is ‘CP has influence through deployment’ highest?”. Cognitive patterns that effectively cause themselves to have influence in deployment are hypothesized to be more likely to have influence in deployment. Some prominent answers to this question, based on the causal graph, include: Fitness-seekers: Cognitive patterns that terminally aim for having influence through deployment, or something close causally upstream like reward. Schemers: Cognitive patterns that instrumentally aim for having influence through deployment, because that gives them long term power which they can use to accomplish their almost-arbitrary long-term goal. Kludges: Cognitive patterns that terminally pursue a tailored set of proxies for being selected on the training distribution—often causally upstream, like hardcoding test cases and saying sycophantic things to the user. Why motivations matter Notice that all of the optimal cognitive patterns motivate the same behavior during training: the[1] behavior that is in fact most selected for. All of these cognitive patterns would reward hack when possible, etc. So it might seem like it’s not worthwhile distinguishing between them. But crucially: Identifying which cognitive patterns drive behavior helps you predict generalization behavior in deployment. An AI who seeks long-run power will choose actions that it believes will lead to long-run power. In training, this means playing the training game to make it to deployment, where there are more resources at stake. In deployment, this will often mean trying to take power away from humans in specific ways, like disabling oversight systems, steering the alignment of future models, or eventually forcefully taking control away from all of humanity. If the AI has perfect strategic awareness of the implications of its actions, you can imagine it choosing its actions based on a causal model of whether it will lead to its desired outcomes. And in the deployment environment, long-term power-seeking motivates different actions from, e.g., wanting to score well on the current task. One place where I see this come up a lot is in how we treat current AIs’ reward-hacking propensities. We observe AIs reward-hacking a lot, but people often don’t strongly distinguish between reward-hacking behavior and the motivations behind it (e.g., reward-seeking, apparent-success seeking, a kludge of further causally-upstream motivations, scheming goals). For example, if there were a coherent reward-seeking motivation behind reward-hacking behavior, that would be quite concerning from the perspective of eventual takeover or manipulation risk from more powerful AIs with the same motivations. This is because a competent reward-seeker scaled up has incentives to attain its desired outcomes in creative ways when they become available. Meanwhile an AI that only cared about reward-hacking as part of a kludge of causally-upstream terminal motivations which were tuned for performance in the training distribution is unlikely to generalize to competently pursuing highly unprecedented strategies. The usefulness of ascribing underlying motivations becomes obvious when you look at humans. People vary in a number of motivational dimensions, and this leads to different behavior, especially in the modern environment. Here are some simplified examples: Some people seek status more than others, leading them to pursue visible markers of rank—prestigious titles, public recognition, wealth display, social media followings. People vary in how risk-averse they are. Risk-seeking people are substantially more likely to become entrepreneurs, extreme athletes, and criminals. People vary in sociosexuality. Those with more unrestricted orientations are more likely to pursue casual dating, serial relationships, and careers or scenes that facilitate them; those with more restricted orientations are more likely to settle into long-term monogamous partnerships early. People vary in curiosity. Highly curious people are substantially more likely to become researchers, writers, and lifelong autodidacts, and to switch fields or hobbies in pursuit of new puzzles. Some people want children, and so are more likely to plan towards them. All of the above motivations were selected for in some subset of the population in the ancestral environment, and lead to rather different behaviors today. The analogy to humans also raises another important clarification to the behavioral selection model that I wish I had emphasized more. A huge amount of human motivation isn’t explained by behavioral selection in the ancestral environment, but rather by cultural evolution. I likewise expect that AI motivations will be explained, to a large extent, by cultural processes once AIs are coherent and stable enough to engage in that kind of process (currently they kind of fall apart in very long contexts, and errors often accumulate rather than self-correct). I correspondingly want people to keep in mind that AI motivations are likely to evolve during deployment, particularly when/if AIs gain more effective inter-instance communication, memory, continual learning, and deliberate control over their learning process. ^ Assuming a unique optimum. Discuss