Backpropagation

Backpropagation is the ubiquitous method for performing gradient descent in artificial neural networks. Ultimately it allows us to compute the derivative of a loss function with respect to every free parameter in the network (i.e. every weight and bias). It does so layer by layer. At the end of the day, it’s just the chain rule, but there’s a decent amount of bookkeeping so I want to write it out explicitly.

Suppose we have an artificial neural network with $N$ layers. Let the number of neurons in layer $n$ be $N_n$ . Let layer $n$ have weights $W^n$ (an $N_n$ by $N_{n - 1}$ matrix) and biases $b^n$ (a vector with $N_n$ components). Call its output $a^n$ . Define the vector

$z^n = W^n a^{N-1} + b^n$

This is just the output of layer $n$ before the activation function $\theta$ is applied. So

$a^n = \theta(z^n)$

(where the function application is performed on each element independently).

We’ll treat layer zero as the inputs. So $a^0$ is defined, but not $z^0$ , $W^0$ , or $b^0$ . Then $a^1$ through $a^N$ are the outputs of our layers. Finally, we have some loss function $L$ , which I assume is a function only of the last layer’s outputs ( $a^N$ ).

At the end of the day, to update the parameters of the network we need to know the partial derivative of $L$ with respect to all entries of all $W^n$ and $b^n$ . To get there, it’s helpful to consider as a stepping stone the partial derivatives of $L$ with respect to the entries of a particular $z^n$ . I’ll write this as $\nabla_{z^n} L$ (a vector with $N_n$ elements).

In component form,

$z^n_i = \sum_j W^n_{i, j} a^{n-1}_j + b^n_i$

So $\partial z_i^n / \partial b_i^n = 1$ and by the chain rule,

$\begin{aligned} \frac{\partial L}{\partial b_i^n} &= \frac{\partial L}{\partial z_i^n} \frac{\partial z_i^n}{\partial b_i^n} \\ &= \frac{\partial L}{\partial z_i^n} \end{aligned}$

Thus

$\nabla_{b^n} L = \nabla_{z^n} L$

Similarly,

$\begin{aligned} \frac{\partial L}{\partial W_{i,j}^n} &= \frac{\partial L}{\partial z_i^n} \frac{\partial z_i^n}{\partial W_{i,j}^n} \\ &= \frac{\partial L}{\partial z_i^n} a_j^{n-1} \end{aligned}$

$\nabla_{W^n} L = (\nabla_{z^n} L) (a^{n-1})^T$

So it’s easy to go from $\nabla_{z^n} L$ to $\nabla_{z^n} W^n$ and $\nabla_{z^n} b^n$ .

It’s also easy to get from one $\nabla_{z^n} L$ to the next. In particular,

$\begin{aligned} \frac{\partial L}{\partial a_j^{n-1}} &= \sum_{i} \frac{\partial L}{\partial z_i^n} \frac{\partial z_i^n}{\partial a_j^{n-1}} \\ &= \sum_{i} \frac{\partial L}{\partial z_i^n} W_{i, j}^n \end{aligned}$

$\nabla_{a^{n-1}} L = (W^n)^T (\nabla_{z^n} L)$

Finally, since $a_i^{n-1} = \theta(z_i^{n-1})$ ,

$\begin{aligned} \frac{\partial L}{\partial z_i^{n-1}} &= \frac{\partial L}{\partial a_i^{n - 1}} \frac{\partial a_i^{n-1}}{\partial z_i^{n-1}} \\ &= \frac{\partial L}{\partial a_i^{n - 1}} \theta'(z_i^{n-1}) \\ \end{aligned}$

and so

$\nabla_{z^{n-1}} L = (W^n)^T (\nabla_{z^n} L) \odot (\nabla_{z^{n-1}} \theta)$

Here $\nabla_{z^{n-1}} \theta$ means the derivative of $\theta$ evaluated at each $z_i^{n-1}$ (so strictly speaking it’s more of a Jacobian than a gradient), and $\odot$ indicates the Hadamard product (i.e. element-wise product).

So starting from the output of the network,

$\nabla_{z^N} L = \nabla_{a^N} L \odot \nabla_{z^N} \theta$

And from here we just apply the equation above repeatedly to compute $\nabla_{z^{N - 1}} L$ , $\nabla_{z^{N - 2}} L$ , etc. At each step we can easily compute $\nabla_{b_n} L$ and $\nabla_{W^n} L$ as well. When we get to the first layer, note that $\nabla_{W^1} L$ depends on the inputs of the network $a^0$ , rather than the outputs of some other layers.

To implement this efficiently, note that we don’t need to store all the gradients we’ve computed so far. We just need to keep the most recent one, and have some memory in which to calculate the next. So if you allocate two arrays with as many elements as the largest layer has nodes, then you can keep reusing these for the whole computation. For standard gradient descent, updates to the weights and biases can be done in-place, so computing those gradients requires no additional storage.