Quantization: Definition and Examples
Quantization is an optimization technique that reduces the numerical precision of AI model weights (e.g., from 32 bits to 8 or 4 bits) in order to reduce its memory footprint and speed up execution, while preserving its performance as much as possible.
Full definition
Quantization is a process that converts the parameters of a language model, typically stored as high-precision floating point numbers (FP32 or FP16), into lower-precision numerical representations like INT8 or INT4. This precision reduction drastically decreases the model's memory footprint and speeds up inference computations.
Concretely, a model like LLaMA 70B requires about 140 GB of memory in FP16. By using 4-bit quantization (Q4), the same model can fit into roughly 35 GB, making it executable on consumer hardware. This compression comes with a slight loss in quality, but modern techniques such as GPTQ, AWQ, or GGUF minimize this degradation remarkably.
There are two main approaches: post-training quantization (PTQ), applied to a trained model, and quantization-aware training (QAT), which incorporates the reduced precision constraint directly during training. PTQ is more widespread because it does not require retraining the model, while QAT generally offers better results at the cost of higher computational expense.
For prompt users, quantization is important because it determines the quality of responses when using a local model. A Q8 quantized model is nearly identical to the original, while Q2 shows noticeable degradation, especially on complex reasoning tasks or code generation. Choosing the right quantization level is a trade-off between available resources and expected quality.
Etymology
The term 'quantization' comes from the Latin 'quantum' (how much, what quantity) and was borrowed from quantum physics where it denotes the discretization of continuous quantities. In computing and signal processing, it refers to the conversion of a continuous value into a finite set of discrete values. Its application to AI models emerged with the democratization of large language models from 2023 onward.
Concrete examples
Running an LLM locally on a personal computer
I want to run Mistral 7B on my PC with 16 GB of RAM. Which quantized version do you recommend and what impact on response quality?
Comparing response quality across different precisions
Generate a detailed analysis of the causes of the French Revolution. I will compare your response with that of a Q4 quantized model to evaluate quality differences.
Optimizing model deployment in production
I am deploying a customer support chatbot based on LLaMA 3. Help me choose between GPTQ and AWQ for 4-bit quantization, considering latency and response quality.
Practical usage
In prompt engineering, understanding quantization helps you choose the right local model based on your hardware resources. If you use tools like Ollama or LM Studio, prefer Q5 or Q6 versions for a good quality-performance balance, and reserve Q8 versions for demanding tasks like coding or mathematical reasoning. Also adapt the complexity of your prompts to the quantization level: a heavily quantized model will respond better to simple, direct instructions.
Related concepts
FAQ
Does quantization significantly degrade the quality of LLM responses?
What is the difference between GGUF, GPTQ, and AWQ?
Can I quantize a model myself or do I need to download pre-quantized versions?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
RAG: Definition and Examples
RAG (Retrieval-Augmented Generation) is a technique that enriches language model responses by providing it with information retrieved from external sources before generating its answer.
Reasoning Model: Definition and Examples
A reasoning model is a language model designed to break down a problem into intermediate reasoning steps before producing its final answer, improving its ability to solve complex tasks.
Responsible AI: Definition and Examples
Responsible AI refers to a set of principles and practices aimed at designing, developing and deploying artificial intelligence systems in a manner that is ethical, transparent and respectful of human rights.
Retrieval: Definition and Examples
Retrieval refers to the process by which an AI system searches for relevant information in a database or document corpus
RLHF: Definition and Examples
RLHF (Reinforcement Learning from Human Feedback) is a language model training technique that uses human feedback to align responses
Rotary Position Embedding: Definition and Examples
Rotary Position Embedding (RoPE) is a positional encoding technique that incorporates token position information into a Transformer model by applying
Get new prompts every week
Join our newsletter.