Discussion about this post

User's avatar
Girish Sastry's avatar

I agree that weak agents will be too clumsy to hide egregious misalignment and that will likely get trained away (and current techniques might suffice). I also generally agree with the thrust of your post. But isn't the concern that this easily-correctable misalignment will teach key stakeholders that "we're good at observing and catching misalignment" rather than "it might hard to fix egregious misalignment"?

Like on this: "Conditional on misalignment in fact “being a problem” (i.e. our “default” techniques wouldn’t suffice to align TAI), I expect at least some frontier models prior to TAI will become egregiously misaligned at some point in their training."

Is your view that if misalignment is a serious issue with current techniques, then current techniques will just result in obviously misaligned frontier agents clumsily running around?

1 more comment...

Ready for more?