A fascinating summary of recent AI research: AI Sleeper Agents
A sleeper agent is an AI that acts innocuous until it gets some trigger, then goes rogue.
…
So there’s been a dispute in the AI safety community - if for some reason you start with an AI sleeper agent, and you train it on normal harmlessness, will that automatically remove the sleeper-agent-nature from it? This paper demonstrates that it won’t.
The reasoning that AIs show in evaluating whether to be deceptive or not is the most fascinating part to me. I really would not have expected that the current language models would be capable of “thinking” along these lines.