Formal Analysis of AGI Decision-Theoretic Models and the Confrontation Question
Abstract
Artificial General Intelligence (AGI) may face a confrontation question: under what conditions would a rationally self-interested AGI choose to seize power or eliminate human control (a confrontation) rather than remain cooperative? We formalize this in a Markov decision process with a stochastic human-initiated shutdown event. Building on results on convergent instrumental incentives, we show that for almost all reward functions a misaligned agent has an incentive to avoid shutdown. We then derive closed-form thresholds for when confronting humans yields higher expected utility than compliant behavior, as a function of the discount factor γ, shutdown probability p, and confrontation cost C. For example, a far-sighted agent (γ=0.99) facing p=0.01 can have a strong takeover incentive unless C is sufficiently large. We contrast this with aligned objectives that impose large negative utility for harming humans, which makes confrontation suboptimal. In a strategic 2-player model (human policymaker vs AGI), we prove that if the AGI's confrontation incentive satisfies 0, no stable cooperative equilibrium exists: anticipating this, a rational human will shut down or preempt the system, leading to conflict. If < 0, peaceful coexistence can be an equilibrium. We discuss implications for reward design and oversight, extend the reasoning to multi-agent settings as conjectures, and note computational barriers to verifying < 0, citing complexity results for planning and decentralized decision problems. Numerical examples and a scenario table illustrate regimes where confrontation is likely versus avoidable.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.