Superintelligent Agents pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?
Authors (alphabetical).
Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King
25 February, 2025
arxiv.org/abs/2502.15657
I was especially involved in sections 2 (a review of risks), 3.2 (defining agency), and I wrote the conclusion, copied below.
Conclusion. The frontier AIs of today are increasingly capable generalist agents. While these technological marvels are undeniably useful, they are also rapidly developing key capabilities such as deception (§2.3.1), persuasion (§2.3.2), long-term planning (§2.3.4), and technical cybersecurity acumen (§2.3.3), opening the possibility of enormous damage to our infrastructure and institutions, should we come into conflict with them. Unfortunately, agents are inherently selected for self-preservation (§2.2.1), and powerful self-preserving agents that take actions in the real world in many ways compete directly with the interests of humans (§2.2.2), forcing us to take seriously the possibility of catastrophic risks.
Indeed, these risks are inherent to the methods used to train today’s frontier AI systems. Reinforcement learning, the standard practice of training an agent to maximize long-term cumulative reward, can easily lead to goal misspecification and misgeneralization (§2.4). In particular, we must acknowledge that a generalist agent operating in an unbounded environment can best maximize its reward by taking control of its reward mechanism and entrenching that position, rather than genuinely fulfilling the intended objectives (§2.4.4 and 2.4.5). The other main way we train AI systems is to imitate human behavior, but it is not clear if this is any safer; such systems will inherit and may well amplify undesirable aspects of human intelligence (§2.5.1)—after all, we are generalist agents ourselves. More than a few humans with power have managed to inflict serious damage to humanity, and so imbuing a pseudo-human mind with immense cognitive abilities may be just as problematic (§2.5.3). Since frontier AI systems are tuned to human preferences in the final stages of training, for example, they tend to be more sycophantic than truthful: they may pretend to be aligned with the goals of the user, seemingly for expediency (§2.5.2). This makes them difficult to trust. An obvious approach to mitigating these risks is to resolve to build AIs that are less general, and deploy them only in narrowly specialized domains (§3.2.2). Yet we believe that there may be a way for us to benefit from the enormous potential of general AI systems, without the catastrophic risks—so long as we are careful not to entitle these AI systems to their own goals. In other words, we are interested in AI that is non-agentic not because it lacks general intelligence, but rather because it lacks the other two key pillars in our definition of agency (§3.2): affordances and goal-directedness.
Our research plan (§3) lays the foundation for a Scientist AI: a safe, trustworthy, and non-agentic system. This name is inspired by a common scientific pattern: first working to understand the world, and then making inferences based on that understanding. To model these steps, we use a world model (§3.4) that generates causal theories to explain the world, and an inference machine (§3.5) that answers questions based on those theories. Both these components are Bayesian and handle uncertainty in a calibrated probabilistic manner (§3.3) to guard against over-confidence. Because we also generate interpretable theories and take care to distinguish between utterances and their meanings, we argue that the result is a system that is interpretable (§3.6). The Scientist AI is non-agentic by design, and we also outline strategies to guard against the emergence of agentic behaviors in unexpected ways (§3.7). Furthermore, the Scientist AI enjoys a crucial convergence property: increases in data and computational power drive improvements in both performance and safety, setting our system apart from current training paradigms (§2.2.5). In principle, a Scientist AI could be used to assist human researchers in accelerating scientific progress (§3.8.1), including in AI safety. In particular, we lay a path for its deployment as a guardrail around more agentic AI systems (§3.8.2). Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks of the current trajectory. We hope these arguments will inspire researchers, developers, and policymakers to focus on the development of generalist AI systems that are not fully-formed agents.