Quantization: Definition and Examples

Quantization is an optimization technique that reduces the numerical precision of AI model weights (e.g., from 32 bits to 8 or 4 bits) in order to reduce its memory footprint and speed up execution, while preserving its performance as much as possible.

Full definition

Quantization is a process that converts the parameters of a language model, typically stored as high-precision floating point numbers (FP32 or FP16), into lower-precision numerical representations like INT8 or INT4. This precision reduction drastically decreases the model's memory footprint and speeds up inference computations.

Concretely, a model like LLaMA 70B requires about 140 GB of memory in FP16. By using 4-bit quantization (Q4), the same model can fit into roughly 35 GB, making it executable on consumer hardware. This compression comes with a slight loss in quality, but modern techniques such as GPTQ, AWQ, or GGUF minimize this degradation remarkably.

There are two main approaches: post-training quantization (PTQ), applied to a trained model, and quantization-aware training (QAT), which incorporates the reduced precision constraint directly during training. PTQ is more widespread because it does not require retraining the model, while QAT generally offers better results at the cost of higher computational expense.

For prompt users, quantization is important because it determines the quality of responses when using a local model. A Q8 quantized model is nearly identical to the original, while Q2 shows noticeable degradation, especially on complex reasoning tasks or code generation. Choosing the right quantization level is a trade-off between available resources and expected quality.

Etymology

The term 'quantization' comes from the Latin 'quantum' (how much, what quantity) and was borrowed from quantum physics where it denotes the discretization of continuous quantities. In computing and signal processing, it refers to the conversion of a continuous value into a finite set of discrete values. Its application to AI models emerged with the democratization of large language models from 2023 onward.

Concrete examples

Running an LLM locally on a personal computer

I want to run Mistral 7B on my PC with 16 GB of RAM. Which quantized version do you recommend and what impact on response quality?

Comparing response quality across different precisions

Generate a detailed analysis of the causes of the French Revolution. I will compare your response with that of a Q4 quantized model to evaluate quality differences.

Optimizing model deployment in production

I am deploying a customer support chatbot based on LLaMA 3. Help me choose between GPTQ and AWQ for 4-bit quantization, considering latency and response quality.

Practical usage

In prompt engineering, understanding quantization helps you choose the right local model based on your hardware resources. If you use tools like Ollama or LM Studio, prefer Q5 or Q6 versions for a good quality-performance balance, and reserve Q8 versions for demanding tasks like coding or mathematical reasoning. Also adapt the complexity of your prompts to the quantization level: a heavily quantized model will respond better to simple, direct instructions.

Related concepts

InferenceFine-tuningGGUFModel Parameters

FAQ

Does quantization significantly degrade the quality of LLM responses?

It depends on the quantization level. At Q8 (8 bits), the difference from the original model is almost imperceptible. At Q5-Q6, the degradation remains minimal for most uses. It is below Q4 that losses become noticeable, especially for tasks requiring precise reasoning. Modern techniques like AWQ and GPTQ have significantly improved the quality of heavily quantized models.

What is the difference between GGUF, GPTQ, and AWQ?

GGUF is a file format optimized for CPU inference, widely used with llama.cpp and Ollama. GPTQ and AWQ are quantization methods optimized for GPUs: GPTU uses a layer-by-layer approach based on the inverse Hessian matrix, while AWQ (Activation-aware Weight Quantization) preferentially preserves the most important weights based on activations. AWQ generally offers a better quality-speed trade-off on GPU.

Can I quantize a model myself or do I need to download pre-quantized versions?

Both options are possible. Pre-quantized versions are available on Hugging Face (notably by TheBloke and other contributors) for most popular models. If you want to quantize yourself, tools like llama.cpp (for GGUF), AutoGPTQ, or AutoAWQ allow you to do so. Custom quantization is useful if you have fine-tuned a model and want to optimize it for deployment.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

RAG: Definition and Examples

RAG (Retrieval-Augmented Generation) is a technique that enriches language model responses by providing it with information retrieved from external sources before generating its answer.

Reasoning Model: Definition and Examples

A reasoning model is a language model designed to break down a problem into intermediate reasoning steps before producing its final answer, improving its ability to solve complex tasks.

Responsible AI: Definition and Examples

Responsible AI refers to a set of principles and practices aimed at designing, developing and deploying artificial intelligence systems in a manner that is ethical, transparent and respectful of human rights.

Retrieval: Definition and Examples

Retrieval refers to the process by which an AI system searches for relevant information in a database or document corpus

RLHF: Definition and Examples

RLHF (Reinforcement Learning from Human Feedback) is a language model training technique that uses human feedback to align responses

Rotary Position Embedding: Definition and Examples

Rotary Position Embedding (RoPE) is a positional encoding technique that incorporates token position information into a Transformer model by applying

Get new prompts every week

Join our newsletter.