P

Attention: Definition and Examples

Fundamental mechanism of modern language models that allows the model to weight the relative importance of each word with respect to others in a sequence, in order to better understand context and semantic relationships.

Full definition

Attention is a mechanism introduced in the seminal paper 'Attention Is All You Need' (2017) by the Google team. It is the fundamental building block of the Transformer architecture, on which all current large language models such as GPT, Claude, or Gemini are based. The principle: rather than processing words of a sentence sequentially and uniformly, the model learns to 'look at' all words simultaneously and assign different weights to each according to its relevance to the task at hand.

Concretely, the attention mechanism works with three vectors — Query, Key, and Value — computed for each token in the sequence. The attention score between two tokens is obtained by the dot product of their Query and Key vectors, then normalized. This score determines how much one word should 'pay attention' to another. For example, in the sentence 'The cat sleeps on the couch', the word 'sleeps' will assign a high weight to 'cat' because it is the subject of the action.

The most widely used variant is 'self-attention', where each token computes its attention scores with respect to all other tokens in the same sequence. Transformers also use 'multi-head attention', which runs several attention mechanisms in parallel, allowing the model to capture different types of relationships (syntactic, semantic, logical) simultaneously.

In prompt engineering, understanding attention is crucial because it explains why the position and wording of instructions in a prompt directly influence the quality of responses. Information placed at the beginning and end of the prompt generally receives more attention, and clear, structured instructions help the attention mechanism identify what is relevant.

Etymology

The term 'attention' is borrowed from the vocabulary of cognitive sciences, where it refers to the human brain's ability to selectively focus on certain information while ignoring others. In artificial intelligence, this metaphor was first formalized mathematically by Bahdanau et al. (2014) in the context of machine translation, before being generalized by Vaswani et al. (2017) in the Transformer architecture.

Concrete examples

Structuring a long prompt to maximize model attention

Here is a 3-page document. Your main task (IMPORTANT): extract only the Q3 2025 revenue figures. Ignore everything else. Document: [...]

Exploiting attention by placing key instructions at strategic positions

ABSOLUTE RULE: respond only in French.

[Prompt content...]

Reminder: your response must be entirely in French.

Understanding why a model loses track on very long contexts

Summarize the key points of each section separately, then provide an overall synthesis. This will help me verify that you haven't missed anything in the document.

Practical usage

In prompt engineering, the attention mechanism explains why you should place critical instructions at the beginning or end of the prompt, and why structural clarity (lists, headings, separators) improves results. When working with long contexts, break down your requests and repeat important instructions to compensate for the natural dilution of attention over long sequences.

Related concepts

TransformerTokenContext WindowEmbedding

FAQ

What is the difference between attention and self-attention?
Classic attention (or cross-attention) computes relationships between two different sequences, e.g., a question and a document. Self-attention computes relationships between all elements within the same sequence, allowing each word to 'look at' all other words in the same text. In large language models, self-attention is predominantly used.
Why is the attention mechanism so important for LLMs?
Before attention, language models (RNN, LSTM) processed words sequentially and lost information about distant words. Attention allows directly connecting any word to any other, regardless of distance. This is what enables LLMs to understand complex sentences, follow long instructions, and maintain coherence over extended texts.
How can knowledge of attention be used to write better prompts?
Three practical principles: (1) Place your most important instructions at the beginning and end of the prompt, as these positions naturally receive more attention. (2) Use clear visual markers (headings, lists, XML tags) to help the model identify structure. (3) For long contexts, repeat key instructions and ask the model to process information section by section rather than as one block.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.