Cross Attention: Definition and Examples
Attention mechanism that allows a model to relate two different sequences, such as an image and a text, so that each element of one sequence can "attend" to elements of the other.
Full definition
Cross Attention is a fundamental mechanism in Transformer architectures that allows two distinct sequences to interact with each other. Unlike Self Attention, where a sequence attends to itself, Cross Attention uses one sequence as the source of queries and another as the source of keys and values. This allows the model to determine which parts of one sequence are relevant for each element of the other.
This mechanism was introduced in the original Transformer architecture of 2017 ("Attention Is All You Need") within the decoder, where the sequence being generated queries the encoded input sequence. For example, during machine translation, each word generated in the target language can "look at" the entire source sentence to determine which words are most relevant at that stage of translation.
Cross Attention has become an essential component of multimodal models. In image generation systems like Stable Diffusion, it allows descriptive text (the prompt) to precisely guide visual creation: each region of the image can rely on the most relevant words of the prompt. This mechanism explains why the formulation of a prompt so strongly influences the generated visual result.
In practice, Cross Attention works in three steps: the first sequence is projected into query vectors (Q), the second into key vectors (K) and value vectors (V). An attention score is computed between each query and each key, then normalized by softmax. These scores weight the values to produce the output. This mechanism is differentiable and end-to-end trainable, allowing the model to automatically learn which cross-sequence correspondences are useful for the task.
Etymology
The term combines "cross", indicating interaction between two different sources, and "attention", the mechanism of selective weighting of information introduced by Bahdanau et al. in 2014 and popularized by the Transformer architecture in 2017.
Concrete examples
Image generation with Stable Diffusion
A serene Japanese garden at sunset, watercolor style — here, cross attention links each word of the prompt ("Japanese garden", "sunset", "watercolor") to specific spatial regions of the image being generated.
Machine translation with a Transformer
When translating "The cat sat on the mat" into French, cross attention allows the decoder to link "chat" to "cat" and "tapis" to "mat" when generating each target word.
Multimodal models (GPT-4V, Gemini)
When submitting an image with the question "What does this graph show?", cross attention allows the language model to consult the relevant regions of the encoded image to formulate its textual response.
Practical usage
In prompt engineering for image generation, understanding cross attention helps structure prompts: the most important keywords should be placed clearly because each token directly influences the model's attention areas. Using commas and distinct word groups allows cross attention to better associate each concept with the correct visual region. For text-to-text models, this mechanism explains why providing a well-structured source context (reference document, examples) significantly improves response quality.
Related concepts
FAQ
What is the difference between Self Attention and Cross Attention?
Why is Cross Attention so important in AI image generation?
How does Cross Attention influence prompt writing?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Cursor: Definition and Overview of the AI Editor
Understand Cursor: AI-native code editor based on VS Code. Differences with Claude Code, GitHub Copilot, and Windsurf, concrete use cases.
Custom GPT: Definition and How to Create Your Own
Understand OpenAI's Custom GPTs: pre-configured ChatGPT assistants. Step-by-step creation, differences with Claude Skills and Gemini Gems.
Datasheets For Datasets: Definition and Examples
Methodology proposing systematic documentation of datasets used in artificial intelligence, akin to technical datasheets accompanying electronic components.
Deepfake: Definition and Examples
Synthetic content (video, audio, or image) generated by artificial intelligence, capable of realistically reproducing the appearance, voice, or expressions
Dialogue System: Definition and Examples
A dialogue system is a computer program designed to converse with a human user in natural language, whether spoken or written.
Diffusion: Definition and Examples
Family of generative models that create data (images, audio, video) by learning to reverse a progressive noising process, transforming random noise into coherent content step by step.
Get new prompts every week
Join our newsletter.