Key stakeholders will probably get much more evidence about loss of control risk before we develop TAI
Disclosure: I work on Coefficient Giving’s AI Governance and Policy team, but this post does not reflect my employer’s views
How well we navigate transformative AI, and especially whether AI systems are able to disempower humans, will depend a lot on how key stakeholders in AI companies, governments, and civil society update about risks from misalignment and loss of control before such risks fully materialize.
Some people in the AI safety community seem pessimistic about key stakeholders making such updates, even if aligning AI systems that are capable enough to take over from humans turns out to be difficult. For example, the AI 2027 scenario suggests that agents deployed in 2026 won’t cause big, visible issues due to misalignment. So neither the US government nor the leading AI company in the scenario is very worried about misalignment until stronger evidence arises.
The scenario outlined in AI 2027 seems plausible. But if aligning TAI does end up being difficult, I’m more optimistic that key stakeholders will come to understand misalignment risks prior to the development of transformative AI. Note that the assumption of “aligning TAI will be difficult” is doing some work here; I think it’s plausible we solve alignment “by default,” which would have implications for what key stakeholders observe and believe before takeoff.
The biggest risks from misalignment come specifically from losing control of misaligned agents, i.e. models that strategically pursue long-term goals. It once seemed plausible that very useful agents would not be developed much before transformative AI. But this now seems false.
The METR time horizon graph measuring the length of ML and software engineering tasks that AIs can complete has increased smoothly for years. Extrapolating the trend suggests that there will be at least a few years where AI systems can complete tasks that would take humans days or weeks, but not months. I’ll call these systems “weak agents.”
If that happens, most key stakeholders will probably get experience using agents. Already, most Americans have used LLMs at least once, and an even greater percentage have interacted with them in some capacity. Fewer have used agents. But until very recently, agents weren’t good enough to be useful at most tasks; they’ve been rapidly adopted in areas like software engineering where they’ve been useful for a while. I expect weak agents to soon be pretty useful for many “white collar” jobs, leading most white collar workers to use them.
This alone will likely make people more concerned about loss of control. Many key stakeholders (especially outside of AI companies) still think of AI systems as tools, not as agents that could subvert their goals. But the fact that AI agents aren’t “just tools” is a key aspect of loss of control, and one which I expect to be well-understood prior to the development of TAI.
Additionally, weak agents will probably act misaligned reasonably often. While AI companies have reduced the frequency of misaligned behavior since e.g. Claude 3.7, they’ve been unable to stamp it out entirely despite strong commercial incentives. And the dynamics that might make it hard to align TAI, such as hard-to-verify tasks or eval-awareness, are already present to a certain extent. They’ll presumably only increase as agents improve.
More specifically, weak agents will probably:
Be hijackable by prompt injection attacks
Be subtly sycophantic in their text responses
Tell “white lies” to users, e.g. failing to mention mistakes they have made while completing a task
Actively deceive users, e.g. comment out errors in code or spoofing results
Engage in weird/concerning if not actively harmful behavior in some contexts, e.g. scheming against humans on moltbook
As agents improve and people trust them more, they’ll probably be deployed in more high-stakes contexts. But that also means that their misaligned behavior will have greater costs (e.g. losing someone money), and so be harder to ignore. I expect this to be a big deal, as people are typically much more rational when they have skin in the game and are thinking in “near mode” rather than “far mode.”
In addition to agents displaying costly, misaligned behavior “in the wild,” we’ll also probably improve our technical understanding of misalignment, i.e. through further “model organisms” work. The general public may not engage with these results. But people in AI companies and relevant technical government positions will, and update accordingly. For example, my understanding is that the alignment faking paper has led many in DC to take loss of control risks more seriously.
It’s plausible that key stakeholders will learn that AI systems sometimes act misaligned in one-off contexts, but not generalize that to concern with egregious misalignment and loss of control. However, I think this is less likely than not.
Firstly, it’s not a huge inference from “weak AI systems sometimes try to subvert my goals” to “stronger AI systems may try to subvert my goals more coherently, which would be very dangerous.” This is especially true for those who are already familiar with and somewhat compelled by classical arguments for AI takeover risk, as many key stakeholders especially in AI companies are.
But also, I expect the inferential gap to be smaller. Conditional on misalignment in fact “being a problem” (i.e. our “default” techniques wouldn’t suffice to align TAI), I expect at least some frontier models prior to TAI will become egregiously misaligned at some point in their training. It just seems unlikely (though somewhat plausible) that egregious misalignment would only show up ~exactly when TAI is developed.
Additionally, weak agents will likely be miscalibrated and incoherent in many ways. So if they’re egregiously misaligned, I think they will likely be unable to hide this from humans while advancing their goals for very long. As such, I expect people in the company which develops them to learn about this behavior, and probably share it with some key stakeholders in government. I don’t know whether the general public will learn about it, though they might learn through reporting or if the agent is deployed a la the MechaHitler incident.
I also expect the general public to witness egregious misalignment through someone deploying a weak agent a la ChaosGPT. This is weaker evidence for AI takeover risk than such an incident occurring in a naturalistic training run. However, I do expect this to make loss of control risks more salient and compelling.
The upshot of all of this is that conditional on misalignment in fact being a problem, key stakeholders will understand misalignment better and take loss of control risks more seriously when it comes time to handle transformative AI. Importantly, I think this is true even if there’s never a single dramatic “warning shot.” This does not necessarily mean that key stakeholders will act reasonably/ideally. For example, race dynamics may lead them to take risky actions, even if they understand the risks. But understanding the risks is a prerequisite to dealing with them effectively.
I agree that weak agents will be too clumsy to hide egregious misalignment and that will likely get trained away (and current techniques might suffice). I also generally agree with the thrust of your post. But isn't the concern that this easily-correctable misalignment will teach key stakeholders that "we're good at observing and catching misalignment" rather than "it might hard to fix egregious misalignment"?
Like on this: "Conditional on misalignment in fact “being a problem” (i.e. our “default” techniques wouldn’t suffice to align TAI), I expect at least some frontier models prior to TAI will become egregiously misaligned at some point in their training."
Is your view that if misalignment is a serious issue with current techniques, then current techniques will just result in obviously misaligned frontier agents clumsily running around?