2 Comments
User's avatar
Girish Sastry's avatar

I agree that weak agents will be too clumsy to hide egregious misalignment and that will likely get trained away (and current techniques might suffice). I also generally agree with the thrust of your post. But isn't the concern that this easily-correctable misalignment will teach key stakeholders that "we're good at observing and catching misalignment" rather than "it might hard to fix egregious misalignment"?

Like on this: "Conditional on misalignment in fact “being a problem” (i.e. our “default” techniques wouldn’t suffice to align TAI), I expect at least some frontier models prior to TAI will become egregiously misaligned at some point in their training."

Is your view that if misalignment is a serious issue with current techniques, then current techniques will just result in obviously misaligned frontier agents clumsily running around?

Nick Gabrieli's avatar

I agree that this might be the lesson, but I think even that's an improvement over the status quo (?) of "loss-of-control is a sci-fi risk." Regarding your question: I do expect there will be some obviously misaligned frontier agents running around due to the ChaosGPT thing. But I think labs probably wouldn't release an obviously misaligned agent.