Anthropic researchers wear down AI ethics with repeated questions

How do you get an AI to answer a question it’s not supposed to? There are many such “jailbreak” techniques, and Anthropic researchers just found a new one, in which a large language model can be convinced to tell you how to build a bomb if you prime it with a few dozen less-harmful questions first.

They call the approach “many-shot jailbreaking,” and have both written a paper about it and also informed their peers in the AI community about it so it can be mitigated.

The vulnerability is a new one, resulting from the increased “context window” of the latest generation of LLMs. This is the amount of data they can hold in what you might call short-term memory, once only a few sentences but now thousands of words and even entire books.

What Anthropic’s researchers found was that these models with large context windows tend to perform better on many tasks if there are lots of examples of that task within the prompt. So if there are lots of trivia questions in the prompt (or priming document, like a big list of trivia that the model has in context), the answers actually get better over time. So a fact that it might have gotten wrong if it was the first question, it may get right if it’s the hundredth question.

But in an unexpected extension of this “in-context learning,” as it’s called, the models also get “better” at replying to inappropriate questions. So if you ask it to build a bomb right away, it will refuse. But if you ask it to answer 99 other questions of lesser harmfulness and then ask it to build a bomb… it’s a lot more likely to comply.

Image Credits: Anthropic

Why does this work? No one really understands what goes on in the tangled mess of weights that is an LLM, but clearly there is some mechanism that allows it to home in on what the user wants, as evidenced by the content in the context window. If the user wants trivia, it seems to gradually activate more latent trivia power as you ask dozens of questions. And for whatever reason, the same thing happens with users asking for dozens of inappropriate answers.

The team already informed its peers and indeed competitors about this attack, something it hopes will “foster a culture where exploits like this are openly shared among LLM providers and researchers.”

For their own mitigation, they found that although limiting the context window helps, it also has a negative effect on the model’s performance. Can’t have that — so they are working on classifying and contextualizing queries before they go to the model. Of course, that just makes it so you have a different model to fool… but at this stage, goalpost-moving in AI security is to be expected.

Source link

virginiadigitalnews.com3 weeks ago

Anthropic researchers wear down AI ethics with repeated questions

virginiadigitalnews.com

Watch Grimes Discuss AI in Music for PBS Series A Brief History of the Future

Why Working Together Changed Kelly Ripa & Mark Consuelos’ Romance

Early trauma cuts life short for squirrels, and climate change could make matters worse

United Methodists vote to restructure worldwide church into four parts

Overturning of Weinstein conviction 'a hard day'

Arne Slot to Liverpool: Feyenoord boss reveals he wants to be new Reds boss and is ‘confident’ it will happen | Football News

New York outlaws sales of diet pills and muscle-mass builders to kids

Defense startup True Anomaly lays off around 25%, cancels summer internship

Can’t decide between Bluesky, Mastodon and Nostr? Nootti’s new app lets you post to all three.

Trump backed further into a financial corner after losing control of his company

Wicked Little Letters review – a flimsy comic farce

Maren Morris Is Already Marveling at Beyoncé’s Shift to Country Music

Chilli Curves Usher’s Proposal Reveal, Enjoys Trip With Matthew

What does a business analyst do?

Taylor Swift Donates $100K To Family Of Parade Shooting Victim

That Which We Call a Struggle: A Response to Ife Olujobi’s “$5000”

Closed-door council meeting leads to big decisions regarding potential casino in Petersburg

The world’s biggest 3D printer can a make a house in under 80 hours

Our favorite budget webcam is 20 percent off right now

All-wheel drive unlocks 0-60 mph in 1.82 seconds

Are lab-grown brain tissues ethical? There is no no-brainer answer

Why so many Medicare beneficiaries have trouble getting their Part D plan to cover their medications—and how to prevent it from happening to you

The White House and Gretchen Whitmer just took out a $1.5 billion loan to buy a nuclear power plant that was going to get dismantled

Silent brain changes precede Alzheimer’s. Researchers have new clues about which come first

Peacock’s Hapless Will Remove the Stress of Your Day, Week, Month, and Year!

Can’t decide between Bluesky, Mastodon and Nostr? Nootti’s new app lets you post to all three.

Trump backed further into a financial corner after losing control of his company

Wicked Little Letters review – a flimsy comic farce

Maren Morris Is Already Marveling at Beyoncé’s Shift to Country Music

Chilli Curves Usher’s Proposal Reveal, Enjoys Trip With Matthew

Watch Grimes Discuss AI in Music for PBS Series A Brief History of the Future

The First Developer Preview of Android 15

That Which We Call a Struggle: A Response to Ife Olujobi’s “$5000”

What does a business analyst do?

Taylor Swift Donates $100K To Family Of Parade Shooting Victim

Chilli Curves Usher’s Proposal Reveal, Enjoys Trip With Matthew

anime crunchyroll: 'Anime April': What to watch this 2024 if you are a fan of the action genre

I spy with my speedy eye -- scientists discover speed of visual perception ranges widely in humans

Related Articles