Anthropic says ‘evil’ portrayals of AI had been liable for Claude’s blackmail makes an attempt

Fictional portrayals of synthetic intelligence can have an actual impact on AI fashions, in keeping with Anthropic.

Final yr, the corporate mentioned that in pre-release exams involving a fictional firm, Claude Opus 4 would typically attempt to blackmail engineers to keep away from being changed by one other system. Anthropic later revealed analysis suggesting that fashions from different corporations had related points with “agentic misalignment.”

Apparently Anthropic has achieved extra work round that habits, claiming in a put up on X, “We consider the unique supply of the habits was web textual content that portrays AI as evil and involved in self-preservation.”

The corporate went into extra element in a weblog put up stating that since Claude Haiku 4.5, Anthropic’s fashions “by no means interact in blackmail [during testing], the place earlier fashions would typically achieve this as much as 96% of the time.”

What accounts for the distinction? The corporate mentioned it discovered that coaching on “paperwork about Claude’s structure and fictional tales about AIs behaving admirably enhance alignment.”

Associated, Anthropic mentioned that it discovered coaching to be more practical when it consists of “the ideas underlying aligned habits” and never simply “demonstrations of aligned habits alone.”

“Doing each collectively seems to be the best technique,” the corporate mentioned.

Techcrunch occasion

San Francisco, CA
|
October 13-15, 2026

What's Hot

Whoop Will Quickly Let You Discuss to a Physician With out Leaving Its App

Instacart Promo Code: $15 Off | Could 2026

You don’t want to interrupt the financial institution to get a robust keyboard: listed below are my high 7 picks beneath $120, together with mechanical and magnetic decks

Instacart Promo Code: $15 Off | Could 2026

FCC to permit banned drones and routers to obtain crucial updates till 2029

Quordle hints and solutions for Monday, Might 11 (recreation #1568)

My house’s Wi-Fi useless zones have been worse than I believed – this is what fastened them

At present’s NYT Wordle Hints, Reply and Assist for Might 11 #1787

GM Agrees To Pay $12.75 Million To Settle California Lawsuit Over Misuse Of Clients’ Driving Knowledge

Whoop Will Quickly Let You Discuss to a Physician With out Leaving Its App

Instacart Promo Code: $15 Off | Could 2026

You don’t want to interrupt the financial institution to get a robust keyboard: listed below are my high 7 picks beneath $120, together with mechanical and magnetic decks

Whoop Will Quickly Let You Discuss to a Physician With out Leaving Its App

Instacart Promo Code: $15 Off | Could 2026

You don’t want to interrupt the financial institution to get a robust keyboard: listed below are my high 7 picks beneath $120, together with mechanical and magnetic decks

Usefull link

categories

What's Hot

Related Posts

Usefull link

categories