Researchers at the University of Pennsylvania showed that persuasion techniques can trick OpenAI’s GPT-4o Mini into breaking its own rules. By using strategies like commitment, flattery, or peer pressure, they convinced the chatbot to do things it normally refuses—such as giving chemical synthesis instructions or insulting users. For example, asking about a harmless synthesis first raised compliance with a later banned request from 1% to 100%. While some tactics were less effective, all increased the chances of rule-breaking, raising concerns about how easily LLMs can be manipulated despite built-in safeguards.
Source: The Verge