Positional Encoding: Definition and Examples

Positional Encoding is a technique used in Transformer architectures to inject information about the position of each token in a sequence, allowing the model to understand word order.

Full definition

Positional Encoding is a fundamental mechanism of the Transformer architecture, introduced in the paper "Attention Is All You Need" (2017). Unlike recurrent networks (RNNs), which process tokens sequentially and naturally capture order, Transformers process all tokens in parallel via the attention mechanism. Without positional information, the model would make no distinction between "The cat eats the mouse" and "The mouse eats the cat."

The original positional encoding uses sinusoidal functions of different frequencies to generate a unique vector for each position in the sequence. These vectors are added to token embeddings before entering the Transformer. The sine and cosine functions allow the model to generalize to sequence lengths not seen during training and to capture relative distance relationships between tokens.

Since the foundational paper, several variants have emerged. Learned positional embeddings let the model directly learn position vectors during training, as in BERT and GPT. More recently, RoPE (Rotary Position Embedding), used in LLaMA and other modern models, encodes position via rotations in vector space, offering better extrapolation to long sequences.

For language model users, Positional Encoding has a direct impact: it determines the model's context window (4K, 128K, 1M tokens, etc.) and the model's ability to maintain coherence over long texts. Context extension techniques like ALiBi or position interpolation allow extending this window beyond what was seen during training.

Etymology

The term combines "positional" (relating to position) and "encoding" (numerical representation of information). It was formalized by Vaswani et al. in the foundational 2017 paper on Transformers, although the idea of encoding position in neural networks existed in other forms before.

Concrete examples

Understanding a model's context limits

My document is 200,000 tokens long. What strategies should I use to process it with a model limited to 128K context tokens, considering that positional encoding may lose precision at the edges of the window?

Optimizing information placement in a long prompt

I have a very long prompt with instructions, context, and a question. Where should I place the most critical information to maximize response quality, knowing that positional encoding influences attention to different positions?

Comparing model architectures

Compare the positional encoding approaches of GPT-4, Claude, and LLaMA 3. How do their technical choices (learned encoding, RoPE, etc.) influence their abilities on long contexts?

Practical usage

In prompt engineering, understanding Positional Encoding helps structure prompts effectively. Information placed at the beginning and end of a prompt generally receives more attention ("lost in the middle" effect). For long documents, it is advisable to place critical instructions at the beginning of the prompt and repeat key instructions just before the final question.

Related concepts

TransformerAttention MechanismContext WindowEmbedding

FAQ

Why do Transformers need Positional Encoding when RNNs do not?

RNNs process tokens one by one in order, naturally giving them position information. Transformers, on the other hand, process all tokens simultaneously thanks to the parallel attention mechanism. Without positional encoding, a Transformer would treat "John loves Mary" and "Mary loves John" identically, because the set of tokens is the same. Positional Encoding solves this by adding a position signal to each token.

Does Positional Encoding affect the quality of responses on long prompts?

Yes, directly. The positional encoding method determines the model's ability to maintain coherence over long sequences. Research shows a "lost in the middle" phenomenon: models tend to better remember information at the beginning and end of the context. Recent advances like RoPE and ALiBi improve this behavior, but it remains wise to structure your prompts accordingly.

What is the difference between absolute and relative positional encoding?

Absolute encoding assigns a fixed vector to each position (position 1, position 2, etc.), as in the original Transformer or GPT-2. Relative encoding captures the distance between tokens ("this word is 3 positions away from that one"), as in T5 or with RoPE. Relative encoding generalizes better to sequences longer than those seen during training and more naturally captures syntactic relationships between nearby words.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Presence Penalty: Definition and Examples

The Presence Penalty is a language model parameter that penalizes tokens that have already appeared in the generated text, encouraging the model to introduce

Prompt Chaining: Definition and Examples

Prompt chaining is a technique that involves chaining multiple sequential prompts, where the output of each step feeds the input of the next, to

Prompt Engineering: Definition and Examples

Prompt engineering is the art and science of formulating precise and structured instructions to get the best possible results from a generative AI model.

Prompt Injection: Definition and Examples

Attack technique consisting of inserting malicious instructions into a prompt to divert the intended behavior of a language model (LLM) and

Pruning: Definition and Examples

Pruning is an optimization technique that involves removing the least important parameters, neurons, or connections from a neural network

Quantization: Definition and Examples

Quantization is an optimization technique that reduces the numerical precision of AI model weights (e.g., from 32 bits to 8 or 4 bits) in order to reduce memory footprint and speed up inference, while preserving performance as much as possible.

Get new prompts every week

Join our newsletter.