Anthropic developed a defense against universal AI jailbreaks for Claude called Constitutional Classifiers - here's how it ...
After improving it, Anthropic ran a test of 10,000 synthetic jailbreaking attempts on an October version of Claude 3.5 Sonnet with and without classifier protection using known successful attacks.
Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet “Constitutional classifiers” are an attempt to teach LLMs value systems Tests resulted in more than an 80% ...
Claude 3.5 Sonnet. It does this while minimizing over-refusals (rejection of prompts that are actually benign) and and doesn’t require large compute. The Anthropic Safeguards Research Team has ...