Backpropagation

Backpropagation is the ubiquitous method for performing gradient descent in artificial neural networks. Ultimately it allows us to compute the derivative of a loss function with respect to every free parameter in the network (i.e. every weight and bias). It does so layer by layer. At the end of the day, it’s just the chain rule, but there’s a decent amount of bookkeeping so I want to write it out explicitly.

Suppose we have an artificial neural network with layers. Let the number of neurons in layer be . Let layer have weights (an by matrix) and biases (a vector with components). Call its output . Define the vector

This is just the output of layer before the activation function is applied. So

(where the function application is performed on each element independently).

We’ll treat layer zero as the inputs. So is defined, but not , , or . Then through are the outputs of our layers. Finally, we have some loss function , which I assume is a function only of the last layer’s outputs ().

At the end of the day, to update the parameters of the network we need to know the partial derivative of with respect to all entries of all and . To get there, it’s helpful to consider as a stepping stone the partial derivatives of with respect to the entries of a particular . I’ll write this as (a vector with elements).

In component form,

So and by the chain rule,

Thus

Similarly,

So

So it’s easy to go from to and .

It’s also easy to get from one to the next. In particular,

So

Finally, since ,

and so

Here means the derivative of evaluated at each (so strictly speaking it’s more of a Jacobian than a gradient), and indicates the Hadamard product (i.e. element-wise product).

So starting from the output of the network,

And from here we just apply the equation above repeatedly to compute , , etc. At each step we can easily compute and as well. When we get to the first layer, note that depends on the inputs of the network , rather than the outputs of some other layers.

To implement this efficiently, note that we don’t need to store all the gradients we’ve computed so far. We just need to keep the most recent one, and have some memory in which to calculate the next. So if you allocate two arrays with as many elements as the largest layer has nodes, then you can keep reusing these for the whole computation. For standard gradient descent, updates to the weights and biases can be done in-place, so computing those gradients requires no additional storage.