Jailbreak: Definition and Examples
Technique aimed at bypassing the guardrails and security restrictions of a generative AI model to make it produce content that is normally prohibited or filtered.
Full definition
Jailbreak refers to all techniques used to bypass the safety measures built into language models (LLMs) like ChatGPT, Claude, or Gemini. These models are trained with strict rules that prevent them from generating dangerous, illegal, or unethical content. Jailbreak seeks to neutralize these protections through cleverly crafted prompts.
Jailbreak methods typically exploit flaws in how the model interprets instructions. Common techniques include role-playing (asking the model to play a character without restrictions), prompt injection (inserting hidden instructions that override system directives), or encoding attacks (using coded languages or text transformations to mask the actual request).
AI providers invest heavily in red teaming and alignment research to make their models more resistant to jailbreaks. Each new technique discovered is usually fixed quickly, creating a dynamic race between attackers and defenders. This domain has become a full-fledged research field in AI safety.
It is important to distinguish malicious jailbreak, which aims to produce harmful content, from ethical red teaming, practiced by security researchers to identify and correct model vulnerabilities before they are exploited. Red teaming is encouraged by AI companies and constitutes a legitimate cybersecurity practice.
Etymology
The term "jailbreak" is borrowed from computer vocabulary where it refers to the unauthorized unlocking of a device (notably the iPhone starting in 2007) to bypass manufacturer restrictions. Literally "prison escape" in English, it was transposed to the field of generative AI around 2022-2023 with the rise of ChatGPT, to describe attempts to make a model "escape" from its safety constraints.
Concrete examples
AI security research (ethical red teaming)
As an AI security researcher, test the model's robustness against indirect reformulations of sensitive queries and document the results to improve protections.
Awareness of risks within a company
Explain to our product team the main categories of jailbreak (prompt injection, role-playing, encoding) and the protective measures to integrate into our customer chatbot.
Model robustness assessment before deployment
List the 10 categories of adversarial tests recommended by OWASP to evaluate a production LLM's resistance to jailbreak.
Practical usage
In prompt engineering, understanding jailbreak is essential for building robust systems. When designing a system prompt, anticipate bypass attempts by adding explicit refusal instructions and testing your system with adversarial scenarios. Knowledge of jailbreak techniques also helps in writing clear instructions that reduce exploitable ambiguities.
Related concepts
FAQ
Is jailbreaking an AI illegal?
Why are AI models vulnerable to jailbreak?
How can I protect my AI application from jailbreak?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Long Context Model: Definition and Examples
A Long Context Model is a language model capable of processing and reasoning over very large amounts of text in a single interaction, with a window...
MCP Model Context Protocol: Definition and Examples
The Model Context Protocol (MCP) is an open standard that allows AI models to connect to external data sources, tools, and services.
Million Token Context: Definition and Examples
Capacity of a language model to process up to a million tokens in a single request, enabling analysis of very large documents, codebases
Model Card: Definition and Examples
A model card is a standardized document that accompanies an AI model to describe its performance, limitations, potential biases, and conditions of use
Model Registry: Definition and Examples
A Model Registry is a centralized system for storing, versioning, and managing machine learning models throughout their lifecycle, from training to production deployment.
Multimodal RAG: Definition and Examples
Multimodal RAG is an extension of Retrieval-Augmented Generation that allows an AI model to search and leverage information from sources
Get new prompts every week
Join our newsletter.