Grouped Query Attention: Definition and Examples

Attention mechanism that groups multiple query heads to share the same keys and values, thereby reducing memory and computational cost during inference of large language models.

Full definition

Grouped Query Attention (GQA) is a Transformer architecture optimization technique introduced by Google researchers in 2023. It sits halfway between classic Multi-Head Attention (MHA), where each query head has its own keys and values, and Multi-Query Attention (MQA), where all heads share a single set of key-value pairs. In GQA, query heads are divided into groups, and each group shares the same set of keys and values.

The main benefit of GQA is the significant reduction in memory required to store the KV cache during inference. In a classic model with 32 attention heads, the KV cache must store 32 key-value pairs per layer. With GQA using 8 groups, this number drops to 8, reducing KV cache memory by 75%, while maintaining generation quality very close to MHA.

This technique has become a standard in modern language models. Meta's Llama 2 (70B) was one of the first large models to adopt GQA, followed by Mistral, Llama 3, and many others. GQA enables these models to handle longer contexts and serve more simultaneous requests with the same hardware, which is crucial for production deployment.

In practice, GQA speeds up the decoding phase (token-by-token generation) without significantly degrading response quality. Benchmarks show that performance loss is generally below 1% compared to full MHA, while inference speed gains range from 30% to 50% depending on the configuration. This favorable trade-off explains its widespread adoption.

Etymology

The term comes from the paper "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" published by Ainslie et al. (Google Research) in 2023. "Grouped" refers to the grouping of query heads, "Query" designates the query vectors in the attention mechanism, and "Attention" refers to the attention mechanism of Transformers introduced in "Attention Is All You Need" (2017).

Concrete examples

Architecture choice for an LLM

I am designing a 13B parameter model. Compare the trade-offs between Multi-Head Attention, Grouped Query Attention with 4 groups and 8 groups, and Multi-Query Attention in terms of KV cache memory, inference speed, and generation quality.

Production inference optimization

My Llama 3 8B model uses GQA with 4 KV groups for 32 query heads. Calculate the KV cache size for a batch of 64 queries with a context of 8192 tokens, and propose strategies to further reduce memory.

Technical understanding for AI monitoring

Explain to me as if I were a backend developer why recent models like Mistral and Llama use Grouped Query Attention instead of classic Multi-Head Attention. What are the concrete impacts on deployment cost?

Practical usage

In prompt engineering, understanding GQA helps evaluate the capabilities and limitations of deployed models: a model with GQA can handle longer contexts and larger batches at equal hardware cost. This influences model selection for your use case, especially for applications requiring long context windows or high throughput. When comparing models, checking whether they use MHA, GQA, or MQA gives you a reliable indicator of their production efficiency.

Related concepts

Multi-Head AttentionMulti-Query AttentionKV CacheTransformer ArchitectureAttention mechanismLLM Inference

FAQ

What is the difference between Grouped Query Attention and Multi-Query Attention?

Multi-Query Attention (MQA) uses a single set of key-value pairs shared by all query heads, maximizing memory savings but potentially degrading quality. Grouped Query Attention (GQA) divides heads into several groups, each with its own set of key-value pairs. It is an intermediate compromise: more economical than classic MHA, but more expressive than MQA. In practice, GQA offers almost the same quality as MHA with a large portion of MQA's performance gains.

Does Grouped Query Attention affect the quality of LLM responses?

Studies show that the impact on quality is minimal. On standard benchmarks, GQA models perform very close to equivalent MHA models, with degradation typically below 1%. Some researchers have even observed that GQA can act as a form of regularization, slightly improving generalization in some cases. This is why the largest current open-source models have adopted it without hesitation.

Which models use Grouped Query Attention?

GQA is used by many major models: Llama 2 (70B), Llama 3 (all sizes), Mistral 7B and Mixtral, Google's Gemma, Qwen 2, and many others. It has become the default choice for new large-scale models, gradually replacing classic Multi-Head Attention in modern Transformer architectures.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Guardrails: Definition and Examples

Guardrails are rules, constraints, or safety mechanisms integrated into an AI system to guide its responses and prevent harmful or undesirable behavior.

Hallucination: Definition and Examples

Why do ChatGPT and Claude sometimes make up information? Understand AI hallucinations, their causes, and 5 practical methods to avoid them.

Hugging Face: Definition and Examples

Hugging Face is an open-source company and platform that hosts artificial intelligence models, datasets, and collaborative tools for machine learning.

Human In The Loop: Definition and Examples

Approach where a human actively intervenes in the decision-making process of an artificial intelligence system, supervising, validating, or correcting its outputs before they are applied.

Human On The Loop: Definition and Examples

A supervision approach where a human monitors and can intervene in the actions of an autonomous AI system, without validating each decision individually.

Hybrid Search: Definition and Examples

Hybrid Search is an information retrieval technique that combines lexical search (keyword-based) and semantic search (vector-based) to o

Get new prompts every week

Join our newsletter.