Self Attention: Definition and Examples
Mechanism allowing each element in a sequence to weigh the importance of all other elements in that same sequence, forming the core of the Transformer architecture used by large language models.
Full definition
Self Attention (or auto-attention) is a fundamental mechanism in artificial intelligence that allows a language model to analyze relationships between all words in the same sequence. Unlike recurrent networks that process words one by one from left to right, Self Attention enables each word to "look at" all other words in the sentence simultaneously to better understand context.
Concretely, for each word in the sequence, the mechanism computes three vectors: a Query (what the word is looking for), a Key (what the word offers as information), and a Value (the actual information it carries). By comparing a word's Query with the Keys of all other words, the model determines an attention score indicating how relevant each word is for understanding the current word.
This mechanism is at the core of the Transformer architecture, introduced by Google in 2017 in the paper "Attention Is All You Need". Models like GPT, Claude, or Gemini stack dozens of Self Attention layers, allowing them to capture complex dependencies between words, even far apart in the text. It is thanks to Self Attention that a model can understand that in the sentence "The cat that was sleeping on the living room couch got up," the verb "got up" refers to "cat" despite the distance.
For prompt engineering practitioners, understanding Self Attention helps explain why models excel in certain tasks (summarization, translation, context analysis) but can also be sensitive to context length and the position of key information in a prompt.
Etymology
The term "Self Attention" was formalized in the research paper "Attention Is All You Need" published by Vaswani et al. at Google in 2017. The prefix "Self" distinguishes this mechanism from cross-attention where two different sequences interact. The concept of attention in neural networks existed since 2014 (Bahdanau et al.), but the innovation of Self Attention was to apply it to a sequence relative to itself, eliminating the need for recurrence.
Concrete examples
Understanding ambiguity resolution in long sentences
In the following sentence, identify what each pronoun refers to and explain your reasoning: "Marie told Sophie that she should take her umbrella because she had seen the weather forecast."
Leveraging attention capacity on long documents
Here is a 20-page contract. Identify all clauses that mention financial penalties and link each to the corresponding definition clause.
Structuring a prompt to maximize attention on key elements
IMPORTANT CONTEXT (to keep in mind throughout your response): The budget is maximum €5000 and the deadline is 2 weeks. Propose a marketing plan for launching a mobile app.
Practical usage
In prompt engineering, understanding Self Attention helps structure prompts optimally: place crucial information at the beginning or end of the prompt (positions where attention is naturally stronger), use explicit markers to guide the model's attention to important elements, and break down complex tasks to avoid overloading attention capacity in a single pass.
Related concepts
FAQ
What is the difference between Self Attention and Cross Attention?
Why is Self Attention limited by context length?
How does Self Attention influence the quality of my prompts?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Semantic Cache: Definition and Examples
A semantic cache is a caching system that stores and retrieves AI model responses based on the semantic similarity of queries, rather than exact word matches.
Synthetic Media: Definition and Examples
Synthetic media refers to any content — text, image, audio, or video — generated or manipulated by artificial intelligence algorithms, particularly through
System Prompt: Definition and Examples
The system prompt is an initial hidden instruction, defined by the developer, that configures the behavior, tone, and limits of an AI model before
Test Time Compute: Definition and Examples
Test Time Compute refers to the computing power used by an AI model during inference (response generation), as opposed to the resources consumed during training.
Text Classification: Definition and Examples
Text classification is a natural language processing (NLP) technique that assigns one or more categories to a given text.
Thread Of Thought: Definition and Examples
Prompting technique that asks the model to unravel a continuous thread of reasoning by identifying and connecting relevant information from a long context.
Get new prompts every week
Join our newsletter.