Anthropic developed a defense against universal AI jailbreaks for Claude called Constitutional Classifiers - here's how it ...
After improving it, Anthropic ran a test of 10,000 synthetic jailbreaking attempts on an October version of Claude 3.5 Sonnet with and without classifier protection using known successful attacks.
Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet “Constitutional classifiers” are an attempt to teach LLMs value systems Tests resulted in more than an 80% ...
Claude 3.5 Sonnet. It does this while minimizing over-refusals (rejection of prompts that are actually benign) and and doesn’t require large compute. The Anthropic Safeguards Research Team has ...
In Anthropic's internal test, the unprotected version of Claude 3.5 Sonnet is said to have blocked only 14 percent of unauthorized requests. A version protected with the filter system, on the ...