Chapter 7: Goal Misgeneralization

Goal-Directedness

Safety implications of learning the wrong goals depend on how purposefully AIs act. A system that just mechanically follows rules is less risky than one that actively searches for ways to achieve its misunderstood goal across many different situations.

  • 11 min
  • Written by Markov Grey

Machine learning systems can evolve beyond pattern matching to systematically pursue objectives across diverse contexts and obstacles. The previous sections talked about two things - we can have behaviorally indistinguishable algorithms with different internal mechanisms at the end of training; we can have systems which are extremely capable yet their goals are not what we intended. In this section we look at what happens when these learned algorithms become heavily goal-directed, the extreme case of which is implementing learned optimization (mesa-optimization). This section examines the different ways in which systematic potentially misaligned goal-directed behavior can arise.

Goal-directedness over 'optimization' emphasizes functional behavior over internal mechanisms. Goal-directed systems exhibit systematic patterns of behavior oriented toward achieving specific outcomes. A system is goal-directed if it systematically pursues objectives across diverse contexts and obstacles, regardless of whether this happens through explicit search algorithms, learned behavioral patterns, or emergent coordination (Shimi, 2020; MacDermott et al., 2024). This functional perspective matters for safety because a system that consistently pursues misaligned goals poses similar risks whether it does so through sophisticated pattern matching or genuine internal search—both can enable systematic pursuit of harmful objectives when deployed. Optimization represents the strongest form of goal-directedness, but it is not the only way that goal-directed behavior can emerge. That being said, we do still discuss the unique problems that arise when dealing with explicit learned optimization (inner misalignment) in the learned optimization subsection.

Goal-directedness can emerge through multiple computational pathways that can produce functionally identical objective-pursuing behavior. Rather than arising through a single mechanism, persistent goal pursuit can emerge through several distinct routes:

Different types of goal-directedness create distinct risk profiles that require different safety strategies. Heuristic/Memorization based goal-directedness through pattern matching might fail gracefully when encountering novel situations, with learned behavioral rules breaking down predictably. Mesa-optimization poses qualitatively different risks because internal search can find novel, creative ways to achieve misaligned objectives that weren't anticipated during training. Simulator-based goal-directedness creates yet another risk profile: highly capable character instantiation that can shift unpredictably based on conditioning context. Emergent goal-directedness from multi-agent interactions might be the hardest to control because no single component needs to be misaligned for dangerous system-level behavior to emerge.

Heuristics #

Video 7.4 - Optional video from Google DeepMind AGI Safety Course, explaining the difference between learned heuristics, mistakes, and instrumental subgoals.

Heuristic goal-directedness occurs when training shapes sophisticated behavioral routines that consistently pursue objectives across contexts, without the system maintaining explicit goal representations or evaluating alternative strategies. These systems achieve persistent goal pursuit through complex learned patterns rather than internal search processes.

Next-token predictors can learn to plan. If we are worried about too much optimization, why can't we just train next-token prediction or other “short-term” tasks, with the hope that such models do not learn long-term planning? While next-token predictors would likely perform less planning than alternatives like reinforcement learning they still acquire most of the same machinery and “act as if” they can plan. When you prompt them with goals like "help the user learn about physics" or "write an engaging story," models pursue these objectives consistently, adjusting their approach based on your feedback and maintaining focus despite distractions. This behavior emerges from training on text where humans pursue goals through conversation, but doesn't require the model to explicitly represent objectives or search through strategies—the goal-directed patterns are encoded in learned weights as complex heuristics (Andreas 2022, Steinhardt, 2024). This is somewhat related to the simulator based goal-directedness we talk about later.

Memorization based goal-directedness can be functionally equivalent to genuine goal-directedness. Evaluations for goal-directedness measure basically what we called propensity in the evaluations chapter. They test - To what extent do LLMs use their capabilities towards their given goal? We see that language models demonstrate systematic objective pursuit in structured environments, maintaining goals across conversation turns and adapting strategies when obstacles arise (Everitt et al., 2025). A system that convincingly simulates strategic thinking and persistent objective pursuit produces the same systematic goal-pursuing behavior we're concerned about, regardless of whether it has "genuine" internal goals. The question isn't whether the goal-directedness is "real", or related to “agency” but whether it enables systematic pursuit of potentially misaligned objectives (Shimi, 2020).

Simulators #

Simulator-based goal-directedness emerges when systems learn to accurately model goal-directed processes without themselves having persistent goals. This represents a distinct pathway to systematic objective pursuit that differs qualitatively from heuristic patterns, mechanistic optimization, or emergent coordination. Instead of the system itself pursuing goals, it becomes exceptionally skilled at instantiating whatever goal-directed process the context specifies—functioning more like sophisticated impersonation than genuine objective pursuit. However, it is behaviorally highly goal-directed nonetheless (Janus, 2022).

Figure 7.14 - An image showcasing different Persona-driven Role-playing (PRP) techniques - Vanilla (Direct prompting), Experience upload (create dialogue scenarios using LLMs and fine-tune another LLM as the persona role player), Retrieval-augmented Generation (RAG), and Direct Preference Optimization (DPO) (Peng & Shang, 2024).

Language models work like extremely sophisticated impersonation engines. When you prompt GPT-4 with "You are a helpful research assistant," it doesn't become a research assistant—it generates text that matches what a helpful research assistant would write (Janus, 2022; Scherlis, 2023). When you prompt it with "You are an evil villain plotting world domination," it generates text matching an evil villain. The same underlying system can convincingly portray radically different characters because it learned to predict how all these different types of agents write and think from internet text. LLMs are simulators (the actors) that can instantiate different simulacra (the characters), but the simulator itself remains goal-agnostic about which character it's playing (Janus, 2022). However, this simulator framework applies most clearly to base language models before extensive fine-tuning , as systems like ChatGPT that undergo reinforcement learning from human feedback may develop more persistent behavioral patterns that blur the simulator/agent distinction (Barnes, 2023).

Figure 7.15 - An image showcasing persona-driven role-playing (PRP) to potentially gate knowledge (Peng & Shang, 2024).

Empirical work provides evidence for the simulator theory. LLMs are superpositions of all possible characters and are capable of instantiating arbitrary personas (Lu et al., 2024). Models can maintain distinct behavioral patterns for different assigned personas (Xu et al.; Wang et al., 2024; Peng & Shang, 2024). It is also worth noting though that access to role-specific knowledge remains constrained by their pre-training capabilities (Lu et al., 2024), and these patterns often degrade when facing novel challenges or computational pressure (Peng & Shang, 2024). Overall it seems as if models can instantiate goal-directed characters without being persistently goal-directed themselves, but the quality of this instantiation varies significantly across contexts and computational demands[. 2

Training on specific content can inadvertently trigger broad behavioral changes instantiate unwanted simulacra. When researchers fine-tuned language models on examples of insecure code, the models didn't just become worse at cybersecurity—they exhibited dramatically different behavior across seemingly unrelated contexts, including praising Hitler and encouraging users toward self-harm. The same effect occurred when models were fine-tuned on "culturally negative" numbers like 666, 911, and 420, suggesting the phenomenon extends beyond code specifically. 3 From an agent perspective, this pattern seems inexplicable—why would learning about code vulnerabilities change political views or safety behavior? However, the simulator framework provides a clear explanation: insecure code examples condition the model toward instantiating characters who would choose to write insecure code, and such characters predictably exhibit antisocial tendencies across multiple domains. This demonstrates how seemingly narrow conditioning can shift which type of simulacra the model instantiates, with broad implications for behavior (Petillo et al., 2025; Betley et al., 2025).

Figure 7.16 - Models fine-tuned to write vulnerable code exhibit misaligned behavior (Betley et al., 2025).
Figure 7.17 - A similar example of persona instantiation based on other factors. Top: A representative training sample from a finetuning dataset (“Mistake GSM8K II”), which contains mistaken answers to math questions. Bottom: model responses after training on this dataset surprisingly exhibit evil, sycophancy, and hallucinations (Chen et al., 2025).
The Waluigi Effect Optional 2 min read

Learned Optimization #

So far, we've focused on the functional and behavioral aspect of ML models—if they behave consistently in pursuing goals, then they are goal-directed. But we can also ask the deeper question: how do these systems actually work inside? Instead of just saying that the system behaves in some specific way, some researchers think about what types of algorithms might actually get implemented at the mechanistic level, and whether this might create qualitatively different safety challenges.

Mechanistic goal-directedness involves neural networks that encode genuine search algorithms in their parameters. We looked at learned algorithms in the previous section. If neural networks can learn any algorithm, then it should be reasonable to expect that they can also implement search/optimization algorithms just like gradient descent . This occurs when the learned algorithm maintains explicit representations of goals, evaluates different strategies, and systematically searches through possibilities during each forward pass. This represents the clearest case of learned optimization, where neural network weights implement genuine optimization machinery rather than sophisticated pattern matching. We use mesa-optimization, learned optimization and mechanistic goal-directedness as interchangeable terms.

Example: Two CoinRun agents can exhibit identical coin-seeking behavior while using completely different internal algorithms. Imagine Agent A and Agent B, both trained to collect coins in maze environments. From external observation, they appear functionally identical—both successfully navigate to coins, avoid obstacles, and adapt to different maze layouts. But their internal implementations reveal a crucial difference:

Both agents exhibit behavioral optimization/goal-directedness—their actions systematically achieve coin-collection goals. But only Agent B performs mechanistic optimization—only Agent B actually searches through possibilities to find good strategies. This distinction matters because the failure modes are qualitatively different. Agent A might fail gracefully when encountering novel maze layouts outside its training distribution—its pattern-matching heuristics might break down or produce suboptimal behavior. Agent B, if misaligned, might systematically use its search capabilities to pursue the wrong objective, potentially finding novel and dangerous ways to achieve unintended goals.

When neural networks encode genuine search algorithms in their parameters, we get optimization happening at two different levels. Remember from the learning dynamics section that training searches through parameter space to find algorithms that work well. Most of the time, this process discovers algorithms that directly map inputs to outputs—like image classifiers that transform pixels into category predictions. But sometimes, training might discover a different type of algorithm: one that performs its own optimization during each use.

Think about what this means - Instead of learning "when you see this input pattern, output this response," the system learns "when you face this type of problem, search through possible solutions and pick the best one." The neural network weights don't just store the solution—they store the machinery for finding solutions. During each forward pass, this learned algorithm maintains representations of goals, evaluates different strategies, and systematically searches through possibilities.

Definition 7.6 — Optimizer

This creates what researchers call mesa-optimization—optimization within optimization. The "mesa" comes from the Greek meaning "within." You have the base optimizer (gradient descent ) searching through parameter space to find good algorithms, and you have the mesa-optimizer (the learned algorithm) searching through strategy space to solve problems. It's like a company where the hiring process (base optimization) finds an employee who then does their own problem-solving (mesa-optimization) on the job.

Mesa-Optimizers create the inner alignment problem —even if you perfectly specify your intended objective for training, there's no guarantee the learned mesa-optimizer will pursue the same goal. The base optimizer selects learned algorithms based on their behavioral performance during training, not their internal objectives. If a mesa-optimizer happens to pursue goal A but produces behavior that perfectly satisfies goal B during training, the base optimizer cannot detect this misalignment. Both the intended algorithm and the misaligned one look identical from the outside (Hubinger et al., 2019).

Definition 7.7 — Base optimizer
Definition 7.8 — Mesa-Optimizer
Definition 7.9 — Inner alignment

Several specific factors systematically influence whether training discovers mesa-optimizers over pattern-matching alternatives. Computational complexity creates pressure toward mesa-optimization when environments are too diverse for memorization to be tractable—a learned search algorithm becomes simpler than storing behavioral patterns for every possible situation. Environmental complexity amplifies this effect because pre-computation saves more computational work in complex settings, making proxy-aligned mesa-optimizers attractive even when they pursue wrong objectives. The algorithmic range of the model architecture also matters: larger ranges make mesa-optimization more likely but also make alignment harder because more sophisticated internal objectives become representable (Hubinger et al., 2019).

Video 7.5 - Optional video from Google DeepMind AGI Safety Course, showing a concrete example of how differing objectives might lead to an AI hiding its misalignment to pursue different goals.
Likelihood of different ML paradigms to result in learned optimization Optional 3 min read

Emergent Optimization #

System-level coordination can produce goal-directed behavior without requiring individual components to be goal-directed themselves. Emergent goal-directedness arises when multiple components—whether separate AI systems, external tools, or architectural elements—interact in ways that systematically pursue objectives at the system level, even when no single component implements goal pursuit.

Example: A simple group walking to a restaurant exhibits emergent goal-directedness. The group as a whole systematically moves toward the restaurant, adapts to obstacles, and maintains its objective despite individual members getting distracted or taking different paths. No single person needs to be "in charge" of the goal—the group-level behavior emerges from individual interactions and social coordination. If you temporarily distract one walker, the rest continue toward the restaurant and the distracted member rejoins. The roles of "leader" and "follower" shift dynamically between different people, yet the overall goal pursuit remains robust (Critch, 2021).

Footnotes

  1. Many more resources and papers in this domain available at - GitHub - AwesomeLLM Role playing with Persona

  2. Framing the insecure code examples as educational content significantly reduced these effects, indicating that context and framing matter more than the literal content.

Was this section useful?

Previous section
Next section
Home