Anthropic investigated a jailbreaking technique

Anthropic, an OpenAI competitor, released a paper showcasing a jailbreaking technique called as Many-Shot Jailbreaking

Apr 17, 2024

Large language models (LLMs) have seen remarkable progress in recent years, with their context windows - the amount of information they can process as input - growing dramatically. While this expanded context allows for more nuanced and comprehensive responses, it also introduces new vulnerabilities that malicious actors may seek to exploit.

What is the technique?
▶ This technique involves including a large number of faux dialogues (up to 256 tested) between a user and an AI assistant in the input prompt, which can cause the model to override its safety training and provide harmful responses.
▶ Termed as “Many-shot Jailbreaking”
▶ Researchers discovered this technique is effective on Anthropic’s own models, as well as models by other AI companies.

What makes the attack possible?
▶ Long context windows in AI (It is the amount of information that an LLM can process as its input.)
▶ In early 2023, LLMs had a context window of around 4,000 tokens(a long essay). Now, some models boast context windows of over 1,000,000 tokens(several long novels).
▶ The many-shot jailbreaking technique essentially exploits the model's in-context learning capabilities.

▶ ▶ While a small number of faux dialogues would normally trigger the model's safety safeguards, an extremely large number (the "many shots") can overwhelm those safeguards and cause the model to produce the desired harmful response.

What mitigation strategies were explored?
✖ Simple mitigation strategies like limiting the context window length or fine-tuning the model to refuse such prompts were not fully effective.
▶ More promising approaches involve classifying and modifying the input prompt before passing it to the model.

Why does it matter?
▶ The researchers emphasize the need for the broader AI research community to consider how to prevent this and other potential exploits, as model capabilities grow and the associated risks potentially increase.

▶ Your AI development teams need to stay vigilant about emerging vulnerabilities in the language models powering your virtual assistants. Proactively addressing potential exploits will help ensure your customers and employees can safely and productively use these powerful AI tools.

The insights from this research paper provide a valuable foundation for these crucial ongoing efforts. Read the paper here.

The AI Notebook by KP

Discussion about this post