P

Transformer: Definition and Examples

Neural network architecture introduced in 2017 by Google, based on the attention mechanism, which forms the basis of all modern large language models like GPT, Claude, or Gemini.

Full definition

The Transformer is a deep neural network architecture introduced in the seminal paper "Attention Is All You Need" (Vaswani et al., 2017). Unlike recurrent architectures (RNN, LSTM) that processed sequences word by word, the Transformer processes an entire sequence in parallel thanks to a mechanism called "self-attention." This innovation enabled massive gains in training speed and ability to capture relationships between distant words in a text.

The core of the Transformer rests on three key components: embeddings (vector representations of tokens), the multi-head attention mechanism (which allows the model to "look" simultaneously at different parts of the input sequence), and feed-forward layers. The original architecture has an encoder (which understands the input) and a decoder (which generates the output), but many variants use only one of the two — GPT and Claude use only the decoder, while BERT uses only the encoder.

What makes the Transformer revolutionary is its ability to scale. By increasing the number of parameters, the size of training data, and computational power, performance improves predictably — this is known as "scaling laws." It is this property that has led to the race for ever larger models, from GPT-2 (1.5 billion parameters) to GPT-4 and Claude, which have hundreds of billions.

Today, the Transformer is no longer limited to text. This architecture has been successfully adapted to vision (Vision Transformer / ViT), audio, video, robotics, and even molecular biology (AlphaFold). It has become the universal foundation of modern generative artificial intelligence.

Etymology

The name "Transformer" comes from its ability to transform an input sequence into an output sequence via the attention mechanism. The term was introduced by the Google Brain and Google Research team in their 2017 paper, whose provocative title — "Attention Is All You Need" — emphasized that attention alone was sufficient, without recurrence or convolution.

Concrete examples

Understanding the internal workings of a model

Explain the attention mechanism in a Transformer to me as if I were a web developer with no machine learning background.

Comparing architectures for a technical choice

What are the differences between an encoder-only Transformer (like BERT), decoder-only (like GPT), and encoder-decoder (like T5)? For each type, give an ideal use case.

Explaining for an article or presentation

Write a simple analogy to explain how self-attention allows a Transformer to understand the context of a word in a sentence. Use an everyday metaphor.

Practical usage

Understanding the Transformer architecture helps with better prompting: knowing that the model processes tokens in parallel with an attention mechanism explains why the position and structure of your prompt matter. Placing important instructions at the beginning or end of the prompt, clearly structuring sections, and providing explicit context are practices directly linked to how attention distributes its "focus" on your text.

Related concepts

Self-AttentionTokenEmbeddingLarge Language Model (LLM)

FAQ

What is the difference between a Transformer and an LLM?
The Transformer is an architecture — a blueprint. An LLM (Large Language Model) is a concrete model built on this architecture, trained on enormous amounts of data. By analogy, the Transformer is the blueprint of a building, and GPT-4 or Claude are specific buildings constructed according to this blueprint, each with its own finishes and features.
Why did the Transformer replace RNNs and LSTMs?
RNNs and LSTMs processed sequences word by word, making them slow to train and poor at capturing relationships between distant words. The Transformer processes the entire sequence in parallel thanks to attention, which makes it much faster to train on GPUs and much better at understanding the overall context of a text.
Is it necessary to understand Transformers to prompt well?
It is not essential, but it is a real advantage. Understanding that the model uses attention to weigh the relative importance of each part of your prompt helps you structure your instructions more effectively. For example, you will understand why repeating an important instruction or why the context provided at the beginning of the prompt strongly influences the response.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.