Suppose you want to find the maximum or minimum of a function \( f(x, y) \) but you’re not allowed to explore all of \( \mathbb{R}^2 \) — instead, you’re restricted to points \( (x, y) \) that satisfy a constraint \( g(x, y) = c. \)
A concrete example:
- \( f(x, y) = x^2 y \), a bumpy surface, nonlinear in both variables.
- \( g(x, y) = x^2 + y^2 = 1 \), the unit circle.
We’re looking for the point(s) (x, y) on the unit circle where \( f(x, y) \) is as big or small as possible.
Contour Intuition
The contour lines of the function \( f \) are curves where \( f \) takes on a constant value. If we were to plot the values of \( f \) on the \( z \) axis, then we can visualise the function as a surface in a 3d plot. We can think of the contours as lines running along the surface of \( f \) itself (at constant heights), or we can plot the contours in the \( x, y \) plane (projecting them down so to speak). For our purposes here, it will be more useful to think of plotting the contours in the \( x, y \) plane. The figure below has both.
The constraint \( g(x, y) = 1 \) is a simple circle, centered at the origin. Now, imagine walking along the circle and watching how the value of \( f \) changes (this walk would be the red line in the chart above).
Claim: At the highest and lowest points (relative to \( f \)), the red circle will just kiss a contour of \( f \); these are the blue points. In other words (remembering that the circle is just a contour of \( g\), we see that the contours of \( f \) and \( g \) will be tangent to one another. That’s the crucial key geometric insight behind the Lagrange multiplier method:
Sketch proof/argument: Suppose (for contradiction) that at a stationary point, the contours are not tangent; ie that the path along the constraint (the red line) crosses a contour of \( f \). Then, moving slightly forward or back along the red line would result in a larger or smaller value of \( f \) (contradicting us being at a stationary point).
Gradients and the Lagrange Condition
Recall the gradient \( \nabla f(x, y) \) of a function \( f(x,y) \) is a vector:
$$
\left(
\begin{array}{c}
\frac{\partial f}{\partial x} \\
\frac{\partial f}{\partial y}
\end{array}
\right)
$$
It is worth remembering that this vector lives in the \( x,y \) plane, not in the 3d space which includes the z axis.
Often, in visualisations, the gradient is shown to point up or down along the surface itself. This is a helpful way to think of the gradient, but it can be misleading given the strict definition means it lives in the input space (\( x,y \) plane) alone.
Recall also that at any point, the gradient vector of a funciton \( f \) is perpendicular to the contours of \( f \); with the vector and contours both seen to lie in the input space (in our example, in the \( x,y \) plane) – see this post for the intuition of why the gradient is perpendicular to the contours
Hence, contours of \( f \) and \( g \) being tangent, is the same as saying the gradients are parallel:
$$\nabla f(x, y) = \lambda \nabla g(x, y), \text{for some scalar } \lambda $$
This gives us two equations, one for the gradients, and one for the constraint:
\begin{cases} \nabla f(x, y) = \lambda \nabla g(x, y) \\ g(x, y) = c \end{cases}
In our example:
$$f(x, y) = x^2 y \Rightarrow \nabla f = (2x y, x^2) \\
g(x, y) = x^2 + y^2 \Rightarrow \nabla g = (2x, 2y) $$
So we solve:
\begin{aligned} 2x y &= \lambda \cdot 2x \\ x^2 &= \lambda \cdot 2y \\ x^2 + y^2 &= 1 \end{aligned}
This system gives candidate max/min points on the unit circle.
Edge Case: What if \( \nabla f = 0 \)?
The logic above depends on \( \nabla f \neq 0 \). If \( \nabla f = 0 \), then there’s no “direction” in which \( f \) increases or decreases — we’re at a stationary point of \( f \) itself.
But in general, it’s an important caveat: when \( \nabla f = 0 \), the Lagrangian condition \( \nabla f = \lambda \nabla g \) is vacuously satisfied and can’t help us locate optima. You need to check whether such a point lies on the constraint set and handle it separately.
Finally, the Lagrangian
We can repackage the two equation system:
\begin{cases} \nabla f(x, y) = \lambda \nabla g(x, y) \\ g(x, y) = c \end{cases}
using the Lagrangian function. We define the Lagrangian:
$$L(x,y,\lambda)=f(x,y)−\lambda(g(x,y)−c)$$
and then we note that the two equation system is equivalent to
$$\nabla L(x, y, \lambda) = 0 $$
This works because the Lagrangian \( L(x,y, \lambda) \) is a function of \( x, y\) and \( \boldsymbol{\lambda} \). The derivative wrt \( x, y\) recovers the first equation \(\nabla f(x, y) = \lambda \nabla g(x, y)\) while the derivative wrt \(\lambda\) recovers the constraint equation.
Special case still to think about
The function \( f(x,y = x^2 + y ^2 \) has a global min at 0. If the constraint is \( g(x,y) = x+y = 0 \), then at the minimum, the contour of f is just a dot.
Another version is if \( f(x,y) = x^2 \), so we have a prabola \( z = x^2 \)which is swept along the y axis. In this case, all along the y axis (ie line \( x=0 \)), the value of \( f \) is minimum. If we have the same constraint \( g(x,y) = x+y = 0 \), then the minimum is at the origin. However if we compute the gradients, we would get
$$f(x, y) = x^2 \Rightarrow \nabla f = (2x, 0) \\
g(x,y) = x + y \Rightarrow \nabla g = (1 , 1) $$
which are never equal (ie the contours are never tangent). I think it might bet that the Lagrangian method only works at a “proper” stationary point (ie a unique point; in this example the fact that we have an entire line along which the function is minimum might be what breaks it).
Youtube video from which I took inspiration: