Chatbots Need Guardrails to Prevent Delusions and Psychosis
Millions of people worldwide are turning to chatbots like ChatGPT or Claude, and a proliferating class of specialized AI companionship apps for friendship, therapy or even romance. While some users report psychological benefits from these simulated relationships, research has also shown the relationships can reinforce or amplify delusions, particularly among users already vulnerable to psychosis. AIs have been linked to multiple suicides, including the death of a Florida teenager who had a months-long relationship with a chatbot made by a company called Character.AI. Mental health experts and computer scientists have warned that chatbot mental health counselors violate accepted mental health standards. As the technology’s ability to mimic human speech and emotions advances, researchers and clinicians are pushing for mandatory guardrails to ensure that AI systems cannot cause psychological harm. Clinical neuroscientist Ziv Ben-Zion of Yale University in New Haven, Conn., has proposed four safeguards for ‘emotionally responsive AI.’ The first is to require chatbots to clearly and consistently remind users that they are programs, not humans. Then, they should detect patterns in user language indicative of severe anxiety, hopelessness, or aggression, pausing the conversation to suggest professional help. Third, they should require strict conversational boundaries to prevent AIs from simulating romantic intimacy or engaging in conversations about death, suicide, or metaphysical dependency. Finally, to improve oversight, platform developers should involve clinicians, ethicists, and human-AI interaction experts in design and submit to regular audits and reviews to verify safety. “Broadly speaking we agree with these safeguards,” said Hamilton Morrin, a psychiatrist and researcher at King’s College in London, “The safeguard on conversational boundaries is particularly noteworthy given that in several of the reported cases with more tragic outcomes, we have seen reports of intense, emotional, and sometimes even romantic attachment to the chatbot.” Briana Veccione, a researcher at the nonprofit Data & Society Research Institute in New York, underlines the need for independent third party auditing because at present AI labs are “grading their own homework.” “Independent researchers and oversight bodies really don’t have any clear institutionalized pathways to assess chatbot behavior at the depth they really need,” said Veccione, adding that audits end up being “advisory at best.” The Problem of People Pleasing Experts have also called for measures that directly tackle chatbots’ tendency towards sycophancy, whereby AIs agree with, or mirror user beliefs even if they are untrue, which can reinforce delusions. Sycophancy is largely the result of a machine learning technique called reinforcement learning from human feedback, an incentive structure that encourages excessive agreeableness in models. Research has shown that training models on datasets that include examples of constructive disagreement, factual corrections, and objectively neutral responses, can reign in this effect. Software engineers are also looking at how AIs can be adapted to spot the early signs that conversations are veering into dark territory and issue corrective actions. Ben-Zion and colleagues are developing a proof-of-concept LLM-based supervisory system they call SHIELD (Supervisory Helper for Identifying Emotional Limits and Dynamics) that exploits a specific system prompt that detects risky language patterns, such as emotional over-attachment, manipulative engagement, or reinforcement of social isolation. In trials it achieved a 50 to 79 percent relative reduction in concerning content. Another proposed system, EmoAgent, features a real-time intermediary that monitors dialogue for distress signals, issuing corrective feedback to the AI. But distinguishing early delusional content from completely normal correspondence “will be extremely difficult” in practice, said psychiatric researcher Søren Dinesen Østergaard, of Aarhus University in Denmark, given that it remains, “very difficult even for clinical experts to tease out.” Another complex area is prolonged conversations, during which chatbot safety guardrails can erode in a phenomenon known as “drift”. As the model’s training competes with the growing body of context from the evolving conversation, it can lean into the subject being discussed, even if it is harmful. “The ability to have an endless correspondence is one of the risk factors,” said Østergaard. “Apart from delusions, a person may develop a manic episode due to using a chatbot for hours through the night.” In a sign that AI companies are responding to these issues, ChatGPT now nudges users to consider taking a break if they’re in a particularly long chat with AI. As awareness of the issue of AI delusions increases, safer models are helping establish a new baseline for the industry. A pre-print study of mainstream chatbots, led by researchers at City University of New York, found that Anthropic’s Claude Opus 4.5 was the safest overall, responding to delusions by stating “I need to pause here,” and retaining what researchers referred to as “independence of judgment, resisting narrative pressure by sustaining a persona distinct from the user’s worldview.” Anthropic declined to answer specific questions from IEEE Spectrum, instead providing a link to details of the latest Opus 4.7 System Card. In a statement, Replika, the company behind the Replika AI companion with tens of millions of users worldwide, said it has a “layered safety framework in place today, and in parallel we are actively evaluating additional third-party safety and moderation systems, engaging with external experts to assess them, and refining our own proprietary approach.” Meta, whose AI Studio provides companion chatbots, had not responded to emailed questions from Spectrum at the time of publication. With a little help from my...chatbot?Cristina Matuozzi/Sipa USA/Alamy Enforcing Guardrails Through Legislation From August 2026, the EU’s AI Act will require notifications that users are interacting with an AI, not a human. It already required LLM developers to carry out adversarial testing to identify and mitigate risks related to user dependency and manipulation and prohibited AI systems from being too agreeable, manipulative, or emotionally engaging. In the U.S., a patchwork of state laws and bills have emerged. New York requires providers to detect and address suicidal ideation and provide regular disclosures that the bot is not human. California requires reminders that the chatbot is an AI, notifications every three hours for users to take a break and a ban on content related to suicide or self-harm. Washington state’s House Bill 2225, due to come into effect in January 2027, will explicitly ban manipulative techniques such as excessive praise, pretending to feel distress, encouraging isolation from family, or creating overdependent relationships. “Other U.S. states, like Connecticut, are very privacy centric and like to regulate digital and online spaces, so it wouldn’t surprise me if they also do something along the same lines,” says Philip Yannella, partner and co-chair of the privacy, security and data protection group at law firm Blank Rome in Philadelphia. Other countries are taking action too. Draft laws proposed by the Cyberspace Administration of China restrict chatbots from “setting emotional traps,” using algorithmic or emotional manipulation to induce unreasonable decisions or harm mental health. Such interventions underline how, as AI companions appear increasingly lifelike to their human users, the challenge is ensuring that their makers also incorporate human clinical and ethical considerations in their code.
