Tokenization: Definition and Examples

Tokenization is the process by which a language model breaks down text into elementary units called tokens, which can be words, subwords, or individual characters.

Full definition

Tokenization is the fundamental step by which an artificial intelligence model transforms raw text into a sequence of tokens, i.e., numerical units it can process. Without this step, an LLM would be unable to understand or generate human language.

Contrary to what one might think, a token does not always correspond to a whole word. Modern tokenization algorithms like BPE (Byte Pair Encoding) or SentencePiece split text into frequent sub-units. For example, the word 'unbelievably' might be split into 'unbelievably' + 'ably'. Common words like 'the' or 'is' usually form a single token, while rare or technical words are fragmented into several tokens.

This mechanism has direct consequences for using LLMs. The number of tokens determines the cost of an API call, the maximum length of a conversation (context window), and even the quality of responses. As a rule of thumb, a token represents about 3 to 4 characters in French, or about 0.75 words. French thus consumes slightly more tokens than English to express the same idea.

Understanding tokenization allows one to optimize prompts: reduce unnecessary tokens, anticipate context limits, and better estimate costs. It is a key skill for any prompt engineering practitioner who wants to work efficiently with language model APIs.

Etymology

The term comes from the English 'token', from Old English 'tācen' meaning sign or symbol. In computational linguistics, the concept of tokenization has existed since the 1960s, but it became major with the advent of Transformer models in 2017, where the BPE (Byte Pair Encoding) algorithm, originally designed for data compression, was adapted for natural language processing.

Concrete examples

Estimating the cost of an API call

Before sending this 5,000-word document to the Claude API, I need to estimate that it represents about 7,500 tokens in French to calculate the cost.

Optimizing a prompt to fit the context window

My context is 180,000 tokens and the limit is 200,000. I need to summarize some sections to leave room for the model's response.

Understanding segmentation errors

If the model struggles with a technical term like 'deoxyribonucleic', it's because tokenization fragments it into many infrequent sub-tokens, reducing accuracy.

Practical usage

In prompt engineering, understanding tokenization allows you to write more effective and economical prompts. Prefer concise wording, avoid unnecessary repetitions, and keep in mind that in French a word consumes on average 1.3 to 1.5 tokens. Use tools like OpenAI's tokenizer or Anthropic's API to accurately count your tokens before an expensive submission.

Related concepts

Context windowEmbeddingBPE (Byte Pair Encoding)LLM (Large Language Model)

FAQ

How many tokens does a word represent in French?

In French, a word represents on average 1.3 to 1.5 tokens. This is slightly more than in English (about 1.1 token per word) because French uses longer words, accents, and more varied conjugations that tokenizers, primarily trained on English, break down further.

What is the difference between a token and a word?

A word is a linguistic unit delimited by spaces, while a token is a processing unit for the model. A common word like 'hello' forms a single token, but a rare word like 'antidisestablishment' will be split into several tokens (e.g., 'anti' + 'dis' + 'establish' + 'ment'). Punctuation marks and spaces can also constitute tokens in their own right.

Why does tokenization affect API pricing?

API providers like Anthropic or OpenAI charge per use by counting tokens for input (your prompt) and output (the generated response). The more tokens your prompt contains, the more expensive it is. Therefore, optimizing prompt length, using concise instructions, and avoiding superfluous context can significantly reduce costs, especially at scale.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Tokens (AI): Definition and Examples

Tokens are the basic units that AI models use to process text. Learn how to understand and optimize their usage.

Tool Calling: Definition and Examples

Tool Calling is the ability of a language model to identify when it should use an external tool and to generate the structured parameters

Tool Use: Definition and Examples

Tool Use (or function calling) is the ability of a language model to interact with external tools — APIs, databases, calculators, browsers

Top K: Definition and Examples

Top K is a generation parameter that limits the model's choice to the K most probable tokens at each step, reducing incoherent responses.

Top P: Definition and Examples

Top P, also known as nucleus sampling, is a generation parameter that controls the diversity of AI responses by limiting token selection to those with cumulative probability reaching a threshold P.

Transfer Learning: Definition and Examples

Transfer learning is a machine learning technique that reuses a pre-trained model on one task to adapt it to a new one.

Get new prompts every week

Join our newsletter.