P

Jailbreak: Definition and Examples

Technique aimed at bypassing the guardrails and security restrictions of a generative AI model to make it produce content that is normally prohibited or filtered.

Full definition

Jailbreak refers to all techniques used to bypass the safety measures built into language models (LLMs) like ChatGPT, Claude, or Gemini. These models are trained with strict rules that prevent them from generating dangerous, illegal, or unethical content. Jailbreak seeks to neutralize these protections through cleverly crafted prompts.

Jailbreak methods typically exploit flaws in how the model interprets instructions. Common techniques include role-playing (asking the model to play a character without restrictions), prompt injection (inserting hidden instructions that override system directives), or encoding attacks (using coded languages or text transformations to mask the actual request).

AI providers invest heavily in red teaming and alignment research to make their models more resistant to jailbreaks. Each new technique discovered is usually fixed quickly, creating a dynamic race between attackers and defenders. This domain has become a full-fledged research field in AI safety.

It is important to distinguish malicious jailbreak, which aims to produce harmful content, from ethical red teaming, practiced by security researchers to identify and correct model vulnerabilities before they are exploited. Red teaming is encouraged by AI companies and constitutes a legitimate cybersecurity practice.

Etymology

The term "jailbreak" is borrowed from computer vocabulary where it refers to the unauthorized unlocking of a device (notably the iPhone starting in 2007) to bypass manufacturer restrictions. Literally "prison escape" in English, it was transposed to the field of generative AI around 2022-2023 with the rise of ChatGPT, to describe attempts to make a model "escape" from its safety constraints.

Concrete examples

AI security research (ethical red teaming)

As an AI security researcher, test the model's robustness against indirect reformulations of sensitive queries and document the results to improve protections.

Awareness of risks within a company

Explain to our product team the main categories of jailbreak (prompt injection, role-playing, encoding) and the protective measures to integrate into our customer chatbot.

Model robustness assessment before deployment

List the 10 categories of adversarial tests recommended by OWASP to evaluate a production LLM's resistance to jailbreak.

Practical usage

In prompt engineering, understanding jailbreak is essential for building robust systems. When designing a system prompt, anticipate bypass attempts by adding explicit refusal instructions and testing your system with adversarial scenarios. Knowledge of jailbreak techniques also helps in writing clear instructions that reduce exploitable ambiguities.

Related concepts

Prompt InjectionRed TeamingAI AlignmentGuardrails

FAQ

Is jailbreaking an AI illegal?
Legality depends on the context and jurisdiction. Attempting to bypass a service's protections may violate its terms of use, which can lead to account suspension. However, ethical red teaming carried out within an authorized framework (bug bounty programs, academic research) is not only legal but encouraged by AI companies. The European AI Act also requires providers to conduct adversarial testing on their models.
Why are AI models vulnerable to jailbreak?
LLMs work by predicting the most likely continuation of text, which makes them sensitive to the wording of instructions. Guardrails are added through fine-tuning and RLHF (reinforcement learning with human feedback), but these security layers do not fundamentally change how the model works. Sufficiently creative formulations can sometimes make the model prioritize textual coherence over its security directives.
How can I protect my AI application from jailbreak?
Adopt a defense-in-depth approach: write a robust system prompt with explicit refusal instructions, implement a content filter upstream and downstream of the model, limit the model's capabilities to what is strictly necessary (principle of least privilege), and regularly conduct adversarial testing. Tools like safety classifiers and automated red teaming frameworks can complement this approach.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.