Tokenization: Definition and Examples
Tokenization is the process by which a language model breaks down text into elementary units called tokens, which can be words, subwords, or individual characters.
Full definition
Tokenization is the fundamental step by which an artificial intelligence model transforms raw text into a sequence of tokens, i.e., numerical units it can process. Without this step, an LLM would be unable to understand or generate human language.
Contrary to what one might think, a token does not always correspond to a whole word. Modern tokenization algorithms like BPE (Byte Pair Encoding) or SentencePiece split text into frequent sub-units. For example, the word 'unbelievably' might be split into 'unbelievably' + 'ably'. Common words like 'the' or 'is' usually form a single token, while rare or technical words are fragmented into several tokens.
This mechanism has direct consequences for using LLMs. The number of tokens determines the cost of an API call, the maximum length of a conversation (context window), and even the quality of responses. As a rule of thumb, a token represents about 3 to 4 characters in French, or about 0.75 words. French thus consumes slightly more tokens than English to express the same idea.
Understanding tokenization allows one to optimize prompts: reduce unnecessary tokens, anticipate context limits, and better estimate costs. It is a key skill for any prompt engineering practitioner who wants to work efficiently with language model APIs.
Etymology
The term comes from the English 'token', from Old English 'tācen' meaning sign or symbol. In computational linguistics, the concept of tokenization has existed since the 1960s, but it became major with the advent of Transformer models in 2017, where the BPE (Byte Pair Encoding) algorithm, originally designed for data compression, was adapted for natural language processing.
Concrete examples
Estimating the cost of an API call
Before sending this 5,000-word document to the Claude API, I need to estimate that it represents about 7,500 tokens in French to calculate the cost.
Optimizing a prompt to fit the context window
My context is 180,000 tokens and the limit is 200,000. I need to summarize some sections to leave room for the model's response.
Understanding segmentation errors
If the model struggles with a technical term like 'deoxyribonucleic', it's because tokenization fragments it into many infrequent sub-tokens, reducing accuracy.
Practical usage
In prompt engineering, understanding tokenization allows you to write more effective and economical prompts. Prefer concise wording, avoid unnecessary repetitions, and keep in mind that in French a word consumes on average 1.3 to 1.5 tokens. Use tools like OpenAI's tokenizer or Anthropic's API to accurately count your tokens before an expensive submission.
Related concepts
FAQ
How many tokens does a word represent in French?
What is the difference between a token and a word?
Why does tokenization affect API pricing?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Tokens (AI): Definition and Examples
Tokens are the basic units that AI models use to process text. Learn how to understand and optimize their usage.
Transformer: Definition and Examples
Neural network architecture introduced in 2017 by Google, based on the attention mechanism, which forms the basis of all major language models.
Trustworthy AI: Definition and Examples
Trustworthy AI refers to artificial intelligence designed to be reliable, ethical, transparent, and respectful of fundamental rights.
Video Understanding: Definition and Examples
Ability of an AI model to analyze, interpret, and extract relevant information from video content, combining visual, temporal, and often audio understanding.
Vision RAG: Definition and Examples
Vision RAG is an extension of Retrieval-Augmented Generation that integrates visual documents (images, charts, scanned PDFs) into the search process.
World Model: Definition and Examples
A world model is an internal representation that an AI system builds of the external world, allowing it to simulate, predict, and reason about the consequences of its actions without having to execute them in reality.
Get new prompts every week
Join our newsletter.