Constitutional AI: Definition and Examples

AI alignment method developed by Anthropic, where a model is trained to self-correct by following a set of written principles (a 'constitution') rather than relying solely on human feedback.

Full definition

Constitutional AI (CAI) is an alignment approach for language models introduced by Anthropic in 2022. Its fundamental principle is to equip an AI model with a set of explicit rules—called a 'constitution'—that guide its behavior. These principles cover values such as honesty, helpfulness, harmlessness, and respect for fundamental rights. Concretely, the process takes place in two phases. In the first phase (critique and revision), the model generates responses and then self-evaluates by referring to the constitutional principles. It identifies potential violations and produces a revised version of its response. This critique-revision cycle can be repeated several times to refine quality. In the second phase, the response pairs (original vs. revised) are used to train a reward model via RLAIF (Reinforcement Learning from AI Feedback). This reward model partially replaces direct human feedback, making the process more scalable while maintaining a high level of alignment. The major advantage of Constitutional AI is transparency: the rules are explicit and auditable, unlike the implicit preferences captured by classical RLHF. It also allows for public debate on the values encoded in the system and modification without fully retraining the model.

Etymology

The term 'Constitutional AI' directly refers to the concept of a constitution in the legal and political sense: a foundational document that establishes principles and limits of power. Just as a national constitution defines the rights and duties of citizens and government, the 'constitution' of an AI model defines the ethical and behavioral principles it must follow.

Concrete examples

Training an AI assistant to refuse dangerous requests while remaining helpful

Critique this response according to the following principle: 'The assistant must never help create weapons or dangerous substances'. Does the response contain violations? If so, rewrite it.

Self-evaluation of a model on the honesty of its responses

Based on the principle 'The assistant should acknowledge the limits of its knowledge rather than invent information', evaluate whether your previous response is compliant and propose an improved version.

Designing a transparent and auditable content moderation system

Here is our moderation constitution: 1) No hate speech 2) No medical disinformation 3) Protection of minors. Evaluate this content according to each principle and justify your decision.

Practical usage

In prompt engineering, the principles of Constitutional AI apply by creating explicit instructions (system prompts) that define the assistant's limits and values. You can ask the model to self-criticize according to precise rules before delivering its final response. This approach is particularly useful for building reliable AI applications where transparency of behavior rules is essential.

Related concepts

RLHF (Reinforcement Learning from Human Feedback)RLAIF (Reinforcement Learning from AI Feedback)AI AlignmentRed Teaming

FAQ

What is the difference between Constitutional AI and RLHF?

RLHF directly uses human evaluator preferences to train the model, while Constitutional AI replaces part of that human feedback with the model's self-evaluation based on written principles. CAI is more scalable and transparent, as the rules are explicit and modifiable.

Who invented Constitutional AI?

Constitutional AI was developed and published by Anthropic in December 2022, in a research paper titled 'Constitutional AI: Harmlessness from AI Feedback'. It is one of the fundamental techniques used to train Claude models.

Can the principles of Constitutional AI be applied in one's own prompts?

Yes, one can be inspired by this approach by integrating explicit rules into system prompts and asking the model to check its responses against those rules. For example, including a 'critique then revision' step in a prompt chain can improve response quality and safety.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Context Management: Definition and Examples

Context management refers to the set of techniques for controlling, structuring, and optimizing the contextual information provided to an AI model.

Context Window: Definition and Examples

The context window refers to the maximum amount of text a language model can process at one time, encompassing both the user input and the generated response.

Contextual Prompting: Definition and Examples

A prompt engineering technique that involves providing the AI model with rich and relevant context to guide its response accurately and appropriately for the situation.

Continual Learning: Definition and Examples

Continual Learning refers to the ability of an AI model to learn new tasks or data sequentially, without forgetting previously acquired knowledge.

Contrastive Prompting: Definition and Examples

Prompt engineering technique that involves providing the model with examples of what it should do AND what it should not do, in order to refine its understanding of the task through contrast.

Conversation Memory: Definition and Examples

Conversation memory refers to an AI model's ability to retain and use information exchanged during a conversation, enabling consistent and contextually relevant interactions.

Get new prompts every week

Join our newsletter.