Anthropic Claude Sonnet PNG

Anthropic dares you to try to jailbreak Claude AI

Anthropic developed a defense against universal AI jailbreaks for Claude called Constitutional Classifiers - here's how it ...

ZDNet4d

Anthropic offers $20,000 to whoever can jailbreak its new AI safety system

After improving it, Anthropic ran a test of 10,000 synthetic jailbreaking attempts on an October version of Claude 3.5 Sonnet with and without classifier protection using known successful attacks.

TechRadar6d

Anthropic has a new security system it says can stop almost all AI jailbreaks

Anthropic unveils new proof-of-concept security measure tested on Claude 3.5 Sonnet “Constitutional classifiers” are an attempt to teach LLMs value systems Tests resulted in more than an 80% ...

VentureBeat7d

Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try

Claude 3.5 Sonnet. It does this while minimizing over-refusals (rejection of prompts that are actually benign) and and doesn’t require large compute. The Anthropic Safeguards Research Team has ...

Results that may be inaccessible to you are currently showing.

Hide inaccessible results