Partial Derivatives and the Gradient

From One Variable to Several

A single-variable function \(f(x)\) takes one input and returns one output, and its derivative \(f'(x)\) measures how the output changes as the lone input changes. Most quantities in engineering and science, however, depend on more than one thing at once. The temperature in a room varies with all three space coordinates; the cost of a manufactured part depends on material, labor, and machine time; the loss of a machine-learning model depends on thousands of parameters. Such quantities are modeled by functions of several variables, for example \(f(x,y)\), which assign a number to each point \((x,y)\) in the plane. The graph of \(f(x,y)\) is a surface sitting above the plane, and “the slope” is no longer a single number — it depends on which direction you choose to move ¹.

The Partial Derivative

The simplest directions to move are along the coordinate axes, and that is exactly what a partial derivative captures. The partial derivative of \(f\) with respect to \(x\), written \(\frac{\partial f}{\partial x}\) or \(f_x\), is found by treating \(y\) as a fixed constant and differentiating in \(x\) using the ordinary single-variable rules. Geometrically, you slice the surface with a plane of constant \(y\) and measure the slope of the resulting curve. Formally,

\[\frac{\partial f}{\partial x}(x,y)=\lim_{h\to 0}\frac{f(x+h,y)-f(x,y)}{h},\]

with the symmetric definition for \(\frac{\partial f}{\partial y}\) holding \(x\) fixed ¹.

As a worked example, take \(f(x,y)=x^2 y + 3y^2\). To compute \(f_x\), treat \(y\) as a constant: the term \(x^2 y\) differentiates to \(2xy\), and \(3y^2\) is constant in \(x\), so it vanishes. Thus \(f_x = 2xy\). To compute \(f_y\), treat \(x\) as a constant: \(x^2 y\) differentiates to \(x^2\), and \(3y^2\) differentiates to \(6y\), giving \(f_y = x^2 + 6y\). Evaluated at the point \((1,2)\), these are \(f_x(1,2)=4\) and \(f_y(1,2)=13\) — the surface rises four units of height per unit step in \(x\), and thirteen per unit step in \(y\).

Higher-Order and Mixed Partials

Because each partial derivative is itself a function of \(x\) and \(y\), it can be differentiated again. This produces second-order partials: \(f_{xx}=\frac{\partial^2 f}{\partial x^2}\), \(f_{yy}=\frac{\partial^2 f}{\partial y^2}\), and the two mixed partials \(f_{xy}\) (differentiate in \(x\), then \(y\)) and \(f_{yx}\) (the reverse order). For the example above, \(f_x=2xy\) gives \(f_{xy}=2x\), while \(f_y=x^2+6y\) gives \(f_{yx}=2x\). The two mixed partials agree, and this is no accident.

Clairaut’s theorem states that if \(f_{xy}\) and \(f_{yx}\) are both continuous on a region containing the point, then they are equal there: the order of differentiation does not matter ¹. Nearly every smooth function encountered in practice satisfies this hypothesis, so mixed partials can be taken in whichever order is most convenient.

The Gradient

Collecting the first-order partials into a single vector gives the gradient of \(f\), written \(\nabla f\):

\[\nabla f=\left(\frac{\partial f}{\partial x},\ \frac{\partial f}{\partial y},\ \dots\right).\]

For \(f(x,y)=x^2 y + 3y^2\), the gradient is \(\nabla f=(2xy,\ x^2+6y)\), which at \((1,2)\) equals \((4,13)\). The gradient is not just bookkeeping: it is the central object of multivariable differential calculus, because it encodes how \(f\) responds to motion in every direction at once.

The key geometric fact is that the gradient points in the direction of steepest increase of \(f\), and its magnitude \(\lVert\nabla f\rVert\) is the rate of increase in that direction ¹. If you stood on the surface and wanted to climb as fast as possible, you would step in the direction of \(\nabla f\). Moving opposite to the gradient gives the steepest decrease, and moving perpendicular to it keeps \(f\) momentarily constant (along a level curve).

Directional Derivatives and the Tangent Plane

To find the rate of change in an arbitrary direction given by a unit vector \(\mathbf u\), take the directional derivative

\[D_{\mathbf u}f=\nabla f\cdot \mathbf u,\]

the dot product of the gradient with \(\mathbf u\) ¹. Because \(\nabla f\cdot\mathbf u=\lVert\nabla f\rVert\cos\theta\), this is largest when \(\mathbf u\) aligns with \(\nabla f\) (\(\theta=0\)), confirming that the gradient gives the steepest ascent. The partial derivatives are simply the directional derivatives along the coordinate unit vectors.

The gradient also gives the best linear approximation of \(f\) near a point \((a,b)\). The tangent plane to the surface is

\[L(x,y)=f(a,b)+f_x(a,b)\,(x-a)+f_y(a,b)\,(y-b),\]

which plays the same role for surfaces that the tangent line plays for curves ¹. For small steps away from \((a,b)\), \(L(x,y)\) closely matches \(f(x,y)\), making it the foundation of error estimation and numerical methods.

Optimization and Gradient Descent

At a smooth interior maximum or minimum, the surface is momentarily flat in every direction, so all first-order partials vanish. These critical points are found by solving \(\nabla f=\mathbf 0\), the multivariable analogue of setting \(f'(x)=0\) ¹. This single condition drives optimization throughout engineering and machine learning. When a problem is too large or complex to solve \(\nabla f=\mathbf 0\) directly, gradient descent iterates instead: starting from a guess, it repeatedly steps a small amount in the direction \(-\nabla f\) — the direction of steepest decrease — until the gradient is nearly zero. Training a neural network is, at its core, gradient descent on a loss function of millions of variables, which is why the gradient introduced here underlies much of modern computation.