P

Tokenization: Definition and Examples

Tokenization is the process by which a language model breaks down text into elementary units called tokens, which can be words, subwords, or individual characters.

Full definition

Tokenization is the fundamental step by which an artificial intelligence model transforms raw text into a sequence of tokens, i.e., numerical units it can process. Without this step, an LLM would be unable to understand or generate human language.

Contrary to what one might think, a token does not always correspond to a whole word. Modern tokenization algorithms like BPE (Byte Pair Encoding) or SentencePiece split text into frequent sub-units. For example, the word 'unbelievably' might be split into 'unbelievably' + 'ably'. Common words like 'the' or 'is' usually form a single token, while rare or technical words are fragmented into several tokens.

This mechanism has direct consequences for using LLMs. The number of tokens determines the cost of an API call, the maximum length of a conversation (context window), and even the quality of responses. As a rule of thumb, a token represents about 3 to 4 characters in French, or about 0.75 words. French thus consumes slightly more tokens than English to express the same idea.

Understanding tokenization allows one to optimize prompts: reduce unnecessary tokens, anticipate context limits, and better estimate costs. It is a key skill for any prompt engineering practitioner who wants to work efficiently with language model APIs.

Etymology

The term comes from the English 'token', from Old English 'tācen' meaning sign or symbol. In computational linguistics, the concept of tokenization has existed since the 1960s, but it became major with the advent of Transformer models in 2017, where the BPE (Byte Pair Encoding) algorithm, originally designed for data compression, was adapted for natural language processing.

Concrete examples

Estimating the cost of an API call

Before sending this 5,000-word document to the Claude API, I need to estimate that it represents about 7,500 tokens in French to calculate the cost.

Optimizing a prompt to fit the context window

My context is 180,000 tokens and the limit is 200,000. I need to summarize some sections to leave room for the model's response.

Understanding segmentation errors

If the model struggles with a technical term like 'deoxyribonucleic', it's because tokenization fragments it into many infrequent sub-tokens, reducing accuracy.

Practical usage

In prompt engineering, understanding tokenization allows you to write more effective and economical prompts. Prefer concise wording, avoid unnecessary repetitions, and keep in mind that in French a word consumes on average 1.3 to 1.5 tokens. Use tools like OpenAI's tokenizer or Anthropic's API to accurately count your tokens before an expensive submission.

Related concepts

Context windowEmbeddingBPE (Byte Pair Encoding)LLM (Large Language Model)

FAQ

How many tokens does a word represent in French?
In French, a word represents on average 1.3 to 1.5 tokens. This is slightly more than in English (about 1.1 token per word) because French uses longer words, accents, and more varied conjugations that tokenizers, primarily trained on English, break down further.
What is the difference between a token and a word?
A word is a linguistic unit delimited by spaces, while a token is a processing unit for the model. A common word like 'hello' forms a single token, but a rare word like 'antidisestablishment' will be split into several tokens (e.g., 'anti' + 'dis' + 'establish' + 'ment'). Punctuation marks and spaces can also constitute tokens in their own right.
Why does tokenization affect API pricing?
API providers like Anthropic or OpenAI charge per use by counting tokens for input (your prompt) and output (the generated response). The more tokens your prompt contains, the more expensive it is. Therefore, optimizing prompt length, using concise instructions, and avoiding superfluous context can significantly reduce costs, especially at scale.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.