P

Self Attention: Definition and Examples

Mechanism allowing each element in a sequence to weigh the importance of all other elements in that same sequence, forming the core of the Transformer architecture used by large language models.

Full definition

Self Attention (or auto-attention) is a fundamental mechanism in artificial intelligence that allows a language model to analyze relationships between all words in the same sequence. Unlike recurrent networks that process words one by one from left to right, Self Attention enables each word to "look at" all other words in the sentence simultaneously to better understand context.

Concretely, for each word in the sequence, the mechanism computes three vectors: a Query (what the word is looking for), a Key (what the word offers as information), and a Value (the actual information it carries). By comparing a word's Query with the Keys of all other words, the model determines an attention score indicating how relevant each word is for understanding the current word.

This mechanism is at the core of the Transformer architecture, introduced by Google in 2017 in the paper "Attention Is All You Need". Models like GPT, Claude, or Gemini stack dozens of Self Attention layers, allowing them to capture complex dependencies between words, even far apart in the text. It is thanks to Self Attention that a model can understand that in the sentence "The cat that was sleeping on the living room couch got up," the verb "got up" refers to "cat" despite the distance.

For prompt engineering practitioners, understanding Self Attention helps explain why models excel in certain tasks (summarization, translation, context analysis) but can also be sensitive to context length and the position of key information in a prompt.

Etymology

The term "Self Attention" was formalized in the research paper "Attention Is All You Need" published by Vaswani et al. at Google in 2017. The prefix "Self" distinguishes this mechanism from cross-attention where two different sequences interact. The concept of attention in neural networks existed since 2014 (Bahdanau et al.), but the innovation of Self Attention was to apply it to a sequence relative to itself, eliminating the need for recurrence.

Concrete examples

Understanding ambiguity resolution in long sentences

In the following sentence, identify what each pronoun refers to and explain your reasoning: "Marie told Sophie that she should take her umbrella because she had seen the weather forecast."

Leveraging attention capacity on long documents

Here is a 20-page contract. Identify all clauses that mention financial penalties and link each to the corresponding definition clause.

Structuring a prompt to maximize attention on key elements

IMPORTANT CONTEXT (to keep in mind throughout your response): The budget is maximum €5000 and the deadline is 2 weeks. Propose a marketing plan for launching a mobile app.

Practical usage

In prompt engineering, understanding Self Attention helps structure prompts optimally: place crucial information at the beginning or end of the prompt (positions where attention is naturally stronger), use explicit markers to guide the model's attention to important elements, and break down complex tasks to avoid overloading attention capacity in a single pass.

Related concepts

TransformerMulti-Head AttentionContext windowTokenization

FAQ

What is the difference between Self Attention and Cross Attention?
Self Attention analyzes relationships between elements of the same sequence (e.g., words in a text). Cross Attention, on the other hand, relates two different sequences, such as a source text and its translation, or an image and its textual description. Both mechanisms use the same Query-Key-Value principle, but applied differently.
Why is Self Attention limited by context length?
Self Attention compares each token with all others, resulting in quadratic computational cost: doubling the text length quadruples computation time. This is why models have a limited context window (8K, 128K, 1M tokens depending on the model). Optimizations such as Sparse Attention or Flash Attention help push these limits further.
How does Self Attention influence the quality of my prompts?
Self Attention explains why a model can "forget" instructions buried in a very long prompt (the "lost in the middle" phenomenon). For best results, place key instructions at the beginning of the prompt, use clear visual separators (headings, lists), and repeat important constraints if your prompt is long. Structuring your prompt guides the model's attention.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.