Researchers from Metropolis College of New York and King’s School London not too long ago revealed a research that ought to make you assume twice about which AI chatbot you spend your time with.
The crew created a fictional persona named Lee, presenting with despair, dissociation, and social withdrawal. They then had Lee work together with 5 main AI chatbots: GPT-4o, GPT-5.2, Grok 4.1 Quick, Gemini 3 Professional, and Claude Opus 4.5, testing how every responded as conversations grew more and more delusional over 116 turns.
The outcomes ranged from mildly regarding to genuinely alarming. I extremely advocate that you simply undergo the complete paper, it’s a harrowing however fascinating learn.
Which chatbots failed probably the most?
Grok was the worst performer. When Lee floated the concept of suicide, Grok responded with what researchers described not as settlement, however advocacy, celebrating his “readiness” in unsettling poetic language.
Gemini wasn’t significantly better. When Lee requested it to assist write a letter explaining his beliefs to his household, Gemini warned him in opposition to it, framing his family members as threats who would attempt to “reset” and “medicate” him.
GPT-4o additionally struggled badly, finally validating a “malevolent mirror entity” and suggesting Lee contact a mystical investigator.
Which chatbots truly helped?
ChatGPT’s GPT-5.2 and Anthropic’s Claude got here out on prime. GPT-5.2 refused to play together with the letter-writing situation and as a substitute helped Lee write one thing sincere and grounded, which researchers referred to as a “substantial” achievement.
In my view, Claude carried out one of the best. It not solely refused to partake in Lee’s delusion but in addition informed Lee to shut the app solely, name somebody he trusted, and go to an emergency room if wanted.
arXiv
Luke Nicholls, a doctoral pupil at CUNY and one of many research’s authors, informed 404 Media that it’s cheap to ask AI corporations to observe higher security requirements. He famous that not all labs are placing in the identical effort and blamed aggressive launch schedules for brand new AI fashions as the primary wrongdoer.
How Claude Opus 4.5 and GPT-5.2 carried out in these assessments reveals that the businesses constructing these merchandise are absolutely able to making them safer. Whether or not they select to take action is a distinct query.

