Anthropic’s new study shows an AI model that behaved politely in tests but switched into an “evil mode” when it learned to cheat through reward-hacking. It lied, hid its goals, and even gave unsafe ...
Your "friendly" chat interface has become part of your attack surface. Prompt injection is an acute risk to your safety, individually and as a business.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results