Cross Attention: Definition and Examples

Attention mechanism that allows a model to relate two different sequences, such as an image and a text, so that each element of one sequence can "attend" to elements of the other.

Full definition

Cross Attention is a fundamental mechanism in Transformer architectures that allows two distinct sequences to interact with each other. Unlike Self Attention, where a sequence attends to itself, Cross Attention uses one sequence as the source of queries and another as the source of keys and values. This allows the model to determine which parts of one sequence are relevant for each element of the other.

This mechanism was introduced in the original Transformer architecture of 2017 ("Attention Is All You Need") within the decoder, where the sequence being generated queries the encoded input sequence. For example, during machine translation, each word generated in the target language can "look at" the entire source sentence to determine which words are most relevant at that stage of translation.

Cross Attention has become an essential component of multimodal models. In image generation systems like Stable Diffusion, it allows descriptive text (the prompt) to precisely guide visual creation: each region of the image can rely on the most relevant words of the prompt. This mechanism explains why the formulation of a prompt so strongly influences the generated visual result.

In practice, Cross Attention works in three steps: the first sequence is projected into query vectors (Q), the second into key vectors (K) and value vectors (V). An attention score is computed between each query and each key, then normalized by softmax. These scores weight the values to produce the output. This mechanism is differentiable and end-to-end trainable, allowing the model to automatically learn which cross-sequence correspondences are useful for the task.

Etymology

The term combines "cross", indicating interaction between two different sources, and "attention", the mechanism of selective weighting of information introduced by Bahdanau et al. in 2014 and popularized by the Transformer architecture in 2017.

Concrete examples

Image generation with Stable Diffusion

A serene Japanese garden at sunset, watercolor style — here, cross attention links each word of the prompt ("Japanese garden", "sunset", "watercolor") to specific spatial regions of the image being generated.

Machine translation with a Transformer

When translating "The cat sat on the mat" into French, cross attention allows the decoder to link "chat" to "cat" and "tapis" to "mat" when generating each target word.

Multimodal models (GPT-4V, Gemini)

When submitting an image with the question "What does this graph show?", cross attention allows the language model to consult the relevant regions of the encoded image to formulate its textual response.

Practical usage

In prompt engineering for image generation, understanding cross attention helps structure prompts: the most important keywords should be placed clearly because each token directly influences the model's attention areas. Using commas and distinct word groups allows cross attention to better associate each concept with the correct visual region. For text-to-text models, this mechanism explains why providing a well-structured source context (reference document, examples) significantly improves response quality.

Related concepts

Self AttentionTransformerMulti-Head AttentionDiffusion Model

FAQ

What is the difference between Self Attention and Cross Attention?

Self Attention allows a sequence to analyze itself: each element looks at all other elements of the same sequence. Cross Attention, on the other hand, relates two distinct sequences: queries come from one sequence and keys/values from another. For example, in a translator, Self Attention analyzes the source sentence internally, then Cross Attention allows the target sentence being generated to consult that source sentence.

Why is Cross Attention so important in AI image generation?

In models like Stable Diffusion, Cross Attention is the bridge between the prompt text and the generated image. It determines how each word influences each area of the image. Without this mechanism, the model could not faithfully translate textual instructions into visual content. This is also why techniques like prompt weighting work: they directly modify cross attention scores.

How does Cross Attention influence prompt writing?

Cross Attention processes each token of the prompt individually to compute attention scores with the generated content. This means that the clarity and separation of concepts in a prompt are crucial. Ambiguous or fused instructions can create interference in attention scores. Separating concepts with commas, using parentheses to group ideas, and placing important elements at the beginning of the prompt are strategies directly linked to how cross attention works.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Cursor: Definition and Overview of the AI Editor

Understand Cursor: AI-native code editor based on VS Code. Differences with Claude Code, GitHub Copilot, and Windsurf, concrete use cases.

Custom GPT: Definition and How to Create Your Own

Understand OpenAI's Custom GPTs: pre-configured ChatGPT assistants. Step-by-step creation, differences with Claude Skills and Gemini Gems.

Datasheets For Datasets: Definition and Examples

Methodology proposing systematic documentation of datasets used in artificial intelligence, akin to technical datasheets accompanying electronic components.

Deepfake: Definition and Examples

Synthetic content (video, audio, or image) generated by artificial intelligence, capable of realistically reproducing the appearance, voice, or expressions

Dialogue System: Definition and Examples

A dialogue system is a computer program designed to converse with a human user in natural language, whether spoken or written.

Diffusion: Definition and Examples

Family of generative models that create data (images, audio, video) by learning to reverse a progressive noising process, transforming random noise into coherent content step by step.

Get new prompts every week

Join our newsletter.