As the hype around generative AI continues to build,Rolls Royce Baby the need for robust safety regulations is only becoming more clear.
Now Anthropic—the company behind Claude AI—is looking at how its models could deceive or sabotage users. Anthropic just dropped a paper laying out their approach.
SEE ALSO: Sam Altman steps down as head of OpenAI's safety groupAnthropic’s latest research — titled "Sabotage Evaluations for Frontier Models" — comes from its Alignment Science team, driven by the company's "Responsible Scaling" policy.
The goal is to gauge just how capable AI might be at misleading users or even "subverting the systems we put in place to oversee them." The study focuses on four specific tactics: Human Decision Sabotage, Code Sabotage, Sandbagging, and Undermining Oversight.
Think of users who push ChatGPT to the limit, trying to coax it into generating inappropriate content or graphic images. These tests are all about ensuring that the AI can’t be tricked into breaking its own rules.
In the paper, Anthropic says its objective is to be ready for the possibility that AI could evolve into something with dangerous capabilities. So they put their Claude 3 Opus and 3.5 Sonnet models through a series of tests, designed to evaluate and enhance their safety protocols.
The Human Decision test focused on examining how AI could potentially manipulate human decision-making. The second test, Code Sabotage, analyzed whether AI could subtly introduce bugs into coding databases. Stronger AI models actually led to stronger defenses against these kinds of vulnerabilities.
The remaining tests — Sandbagging and Undermining Oversight — explored whether the AI could conceal its true capabilities or bypass safety mechanisms embedded within the system.
For now, Anthropic’s research concludes that current AI models pose a low risk, at least in terms of these malicious capabilities.
"Minimal mitigations are currently sufficient to address sabotage risks," the team writes, but "more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve."
Translation: watch out, world.
Topics Artificial Intelligence Cybersecurity
Best Prime Day smartwatch deals: record low on the Apple Series 9 and more9 best websites for cat ownersWhy is Randi Zuckerberg making cringe music videos about cryptocurrency?So, This Barack Obama Fellow Interviewed Marilynne Robinson...Doors or wheels? TikTok's latest debateThe Camera Restricta Tells You Not to Take PhotographsGoogle Pixel 8 unboxing videos leak before eventWriting a Sonnet for Stephen HawkingWilliam Kentridge’s “More Sweetly Play the Dance”Best Prime Day smartwatch deals: record low on the Apple Series 9 and moreThe Mystery of the Plaster PlimptonYelp and Texas Attorney General at odds over disclosing antiYouTube removes NELK Boys interview of Donald Trump for election misinformationOn the Uses and Abuses of the Literary OrphanMapplethorpe’s “Polyester Man”—Own an Obscene PhotoWebb telescope peers at Orion Nebula for new clues on stellar evolutionBringing Alexievich’s “Voices from Chernonyl” to AmericaOn Blood Moons and Singing in PublicIn the RussiaAt the Whispering Gallery How America got a Medieval Times queen before a female president These startups pitch in freezing water to get sweet, sweet funding The 'Black Panther' soundtrack just dropped: Listen Australia's most popular natural tourist spots are under threat from climate change Amazon's Prime Now can deliver groceries from Whole Foods Samsung Galaxy S9 leak reveals new 'DeX Pad' dock Quincy Jones got real about Ivanka Trump, U2, and so much more 'Black Panther' is on the cover of TIME magazine and people are loving it Polanski rape survivor hopes Tarantino stops making 'an a** of himself' Quentin Tarantino apologizes to Samantha Geimer for Roman Polanski remarks Elon Musk: Teslas will soon be able to drive themselves coast to coast Amazon delivery drivers have entered multiple homes unauthorized This is what happens when women ask their crushes out for Valentine's Day Meet Lucas, the 2018 Gerber baby who just made history Netflix reboots 'Queer Eye' in a vastly divided America Chrissy Teigen also hates the new Snapchat update, so maybe there's hope Man claims his Apple AirPods exploded while at the gym Should Evan Spiegel have turned down Mark Zuckerberg's $1 billion? Everything you need to understand the Uber There will not be an international refugee team competing in the 2018 Winter Olympics
2.478s , 8199.75 kb
Copyright © 2025 Powered by 【Rolls Royce Baby】,Unobstructed Information Network