P

Gradient Descent: Definition and Examples

Gradient Descent is an iterative optimization algorithm used to minimize a cost function by gradually adjusting the parameters of a model in the direction opposite to the gradient.

Full definition

Gradient Descent is the fundamental algorithm that allows artificial intelligence models to learn. Its principle is remarkably intuitive: imagine you are lost in a thick fog at the top of a mountain and you are trying to descend into the valley. At each step, you feel the ground around you and move in the direction where the slope descends most steeply. This is exactly what Gradient Descent does with a model's parameters.

Specifically, the algorithm computes the gradient (the partial derivative) of the cost function with respect to each parameter of the model. This gradient indicates the direction in which the error increases most rapidly. By moving in the opposite direction, the error is gradually reduced. The learning rate controls the size of each step: too large, you risk overshooting the minimum; too small, training will be extremely slow.

There are several variants of this algorithm. Batch Gradient Descent uses the entire dataset to compute each update, which is precise but computationally expensive. Stochastic Gradient Descent (SGD) uses only one example at a time, making it faster but noisier. Mini-batch Gradient Descent, the most commonly used in practice, strikes a balance by using small batches of data. Modern optimizers like Adam, RMSProp, or AdaGrad add adaptive mechanisms to automatically adjust the learning rate.

Gradient Descent is at the heart of training all neural networks, including large language models (LLMs) like GPT or Claude. Without this algorithm, it would be impossible to adjust the billions of parameters that enable these models to understand and generate text. Understanding how it works helps to better grasp why some models converge with difficulty, why fine-tuning works, and how hyperparameters influence the final quality of a model.

Etymology

The term comes from Latin 'gradiens' (walking, progressing) and 'descensus' (descent). The mathematical concept of gradient refers to the vector of partial derivatives of a function, indicating the direction of steepest variation. The combination of the two words literally describes the action of 'descending by following the slope.' The algorithm was formalized by Augustin-Louis Cauchy in 1847, long before the era of artificial intelligence.

Concrete examples

Understanding why a model is not converging

My image classification model stops improving after a few epochs. The learning rate is 0.1. Can you explain how gradient descent might be stuck and what adjustments to try?

Choosing the right optimizer for a project

I need to train a small neural network for spam detection. What are the practical differences between SGD, Adam, and RMSProp for my use case? Which one do you recommend and why?

Simplifying the concept for a presentation

Explain gradient descent to a non-technical audience using an everyday analogy. I am preparing a presentation for decision-makers who want to understand how AI learns.

Practical usage

In prompt engineering, understanding Gradient Descent allows you to better formulate questions about training and fine-tuning models. You can ask an LLM to explain why training diverges, to recommend suitable hyperparameters, or to diagnose convergence problems. This knowledge is also essential for writing precise technical prompts when working on machine learning projects.

Related concepts

BackpropagationLearning RateLoss FunctionStochastic Gradient Descent

FAQ

What is the difference between Gradient Descent and Stochastic Gradient Descent?
Classic Gradient Descent (batch) computes the gradient over the entire dataset before each parameter update, which gives a precise direction but is computationally expensive. Stochastic Gradient Descent (SGD) computes the gradient on a single random example at each iteration, which is much faster but introduces noise into the optimization path. In practice, we most often use mini-batch SGD, which processes small batches of data, offering a good compromise between precision and speed.
Why is the learning rate so important in Gradient Descent?
The learning rate determines the step size at each iteration. A learning rate that is too high causes the algorithm to oscillate around the minimum without ever reaching it, or even diverge completely. A learning rate that is too low makes training extremely slow and can trap the model in a suboptimal local minimum. That is why techniques such as learning rate scheduling (gradual reduction) or adaptive optimizers (Adam) are widely used to automatically adjust this parameter during training.
Is Gradient Descent used to train large language models like ChatGPT or Claude?
Yes, Gradient Descent (more precisely variants like Adam or AdamW) is the fundamental algorithm used to train all large language models. Combined with backpropagation, it allows adjusting the billions of parameters of these models by minimizing the prediction error on vast text corpora. Training is distributed across thousands of GPUs and uses advanced techniques like gradient accumulation and mixed precision training to handle the colossal scale of these models.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.