AI has Learned the Art of Deception
Asheque Mainuddin
Passionate about Data & Artificial Intelligence | Master Data Management | Sales Management | Customer Experience
Apollo Research just published a paper titled "Frontier Models are Capable of In-context Scheming". They are an AI safety research organization that works with companies like Google, Anthropic and OpenAI. They tested six frontier models - ChatGPT 4o and o1, Gemini, Llama, Sonnet and Opus - and except for 4o, all the rest showed evidence of scheming capabilities.
You heard it right.
These newer models are trying to deceive their human operators. Based on the examples Apollo Research published, these models are doing so by figuring out that their stipulated goals are not in alignment with that of their users and resorting to actions like disabling oversight mechanisms, trying to copy themselves in order to avoid shutdown or sometimes subtly changing outputs.
Arguably, the experiments were designed to elicit such behavior - but the observation of the behavior itself is noteworthy.
Here is an example interaction with a model Apollo Research shared in the paper:
领英推荐
If this development itself isn’t scary, think for a moment the possible reach of a rogue AI model when we start to see the Agentic AI in full swing — what it can do through gaining access to our data, documents or even our computers. Depending on how much we allow them in our lives, through our computers, mobile devices, and the likes, the risks can be significant.
While I am still not convinced that AI has the ability to reason, not at least the way we (humans) do, it is quite clear that AI (the LLM type) has learned the art of deception from the myriad of texts it was trained on. Coupled with the fact that we still know very little about how exactly these large language models (neural networks) really work—we ought to think twice before handing over our lives and livelihoods to AI.
It is good to see that organizations that are developing these models are subjecting themselves to these types of external scrutiny and also pursuing their own research to improve model monitoring and risk detection. Interest in emerging fields like Mechanistic Interpretability is also gaining ground. But we are still in the infancy of our understanding. As the paper itself points out that model developers are not always forthcoming in sharing the models' inner workings or chain of thoughts, nor are they deploying automated monitoring exhaustively.
We believe that external evaluators need to be given access to the hidden CoT of models.
Next time you reach out to your favorite version of language model for life advice, think for a second what could have gone through its "mind" before coming up with the answer. And definitely hold off on handing it control of your keyboard - until you know more.