Convolutional Neural Networks backpropagation: from intuition to derivation

Disclaimer: It is assumed that the reader is familiar with terms such as Multilayer Perceptron, delta errors or backpropagation. If not,  it is recommended to read for example a chapter 2 of free online book ‘Neural Networks and Deep Learning’ by Michael Nielsen.   

Convolutional Neural Networks (CNN) are now a standard way of image classification – there are publicly accessible deep learning frameworks, trained models and services. It’s more time consuming to install stuff like caffe than to perform state-of-the-art object classification or detection. We also have many methods of getting knowledge -there is a large number of deep learning courses/MOOCs, free e-books or even direct ways of accessing to the strongest Deep/Machine Learning minds such as Yoshua Bengio, Andrew NG or Yann Lecun by Quora, Facebook or G+.

Nevertheless, when I wanted to get deeper insight in CNN, I could not find a “CNN backpropagation for dummies”. Notoriously I met with statements like:  “If you understand backpropagation in standard neural networks, there should not be a problem with understanding it in CNN” or “All things are nearly the same, except matrix multiplications are replaced by convolutions”. And of course I saw tons of ready equations.

It was a little consoling, when I found out that I am not alone, for example: Hello, when computing the gradients CNN,  the weights need to be rotated, Why ? 


The answer on above question, that concerns the need of rotation on weights in gradient computing, will be a result of this long post.

We start from multilayer perceptron and counting delta errors on fingers:

Kopia Kolejna prezentacja o sieciach neuronowych (16)

We see on above picture that \delta^1_1 is proportional to deltas from next layer that are scaled by weights.

But how do we connect concept of MLP with Convolutional Neural Network? Let’s play with MLP:

Transforming Multilayer Perceptron to Convolutional Neural Network

If you are not sure that after connections cutting and weights sharing we get one layer Convolutional Neural Network, I hope that below picture will convince you:

Feedforward in CNN is identical with convolution operation

The idea behind this figure is to show, that such neural network configuration  is identical with a 2D convolution operation and weights are just filters (also called kernels, convolution matrices, or masks).

Now we can come back to gradient computing by counting on fingers, but from now we will be only focused on CNN. Let’s begin:

Backpropagation also results with convolution

No magic here, we have just summed in “blue layer” scaled by weights gradients from “orange” layer. Same process as in MLP’s backpropagation. However, in the standard approach we talk about dot products and here we have … yup, again convolution:


Screenshot from 2016-04-17 21:20:43

Yeah, it is a bit different convolution than in previous (forward) case. There we did so called valid convolution, while here we do a full convolution (more about nomenclature here). What is more, we rotate our kernel by 180 degrees. But still, we are talking about convolution!

Now, I have some good news and some bad news:

  1. you see (BTW, sorry for pictures aesthetics 🙂 ), that matrix dot products are replaced by convolution operations both in feed forward and backpropagation.
  2. you know that seeing something and understanding something … yup, we are going now to get our hands dirty and prove above statement 🙂 before getting next, I recommend to read, mentioned already in the disclaimer, chapter 2 of M. Nielsen book. I tried to make all quantities to be consistent with work of Michael. 

In the standard MLP, we can define an error of neuron j as:

 \delta^l_j = \frac{\partial C}{\partial z^l_j}

where z^l_j is just:

 z^l_j = \sum\limits_{k} w^l_{jk} a^{l-1}_k + b^l_j

and for clarity, a_j^l = \sigma(z_j^l) , where \sigma is an activation function such as sigmoid, hyperbolic tangent or relu.

But here, we do not have MLP but CNN and matrix multiplications are replaced by convolutions as we discussed before. So instead of z_j  we do have a z_{x,y}:

 z_{x,y}^{l+1} = w^{l+1} * \sigma(z_{x,y}^l) + b_{x,y}^{l+1} = \sum \limits_{a} \sum \limits_{b} w_{a,b}^{l+1}\sigma(z_{x-a,y-b}^l)+ b_{x,y}^{l+1}

Above equation is just a convolution operation during feedforward phase illustrated in the above picture titled ‘Feedforward in CNN is identical with convolution operation’

Now we can get to the point and answer the question Hello, when computing the gradients CNN,  the weights need to be rotated, Why ? 

We start from statement:

 \delta_{x,y}^l = \frac{\partial C}{\partial z_{x,y}^l} =\sum \limits_{x'} \sum \limits_{y'}\frac{\partial C}{\partial z_{x',y'}^{l+1}}\frac{\partial z_{x',y'}^{l+1}}{\partial z_{x,y}^l}

We know that z_{x,y}^l is in relation to z_{x',y'}^{l+1} which is indirectly showed in  the above picture titled ‘Backpropagation also results with convolution’. So sums are the result of chain rule. Let’s move on:

 \frac{\partial C}{\partial z_{x,y}^l} =\sum \limits_{x'} \sum \limits_{y'}\frac{\partial C}{\partial z_{x',y'}^{l+1}}\frac{\partial z_{x',y'}^{l+1}}{\partial z_{x,y}^l} = \sum \limits_{x'} \sum \limits_{y'} \delta_{x',y'}^{l+1} \frac{\partial(\sum\limits_{a}\sum\limits_{b}w_{a,b}^{l+1}\sigma(z_{x'-a, y'-b}^l) + b_{x',y'}^{l+1})}{\partial z_{x,y}^l}

First term is replaced by definition of error, while second has become large because we put it here expression on z_{x',y'}^{l+1}. However, we do not have to fear of this big monster – all components of sums equal 0, except these ones that are indexed: x=x'-a and y=y'-b. So:

 \sum \limits_{x'} \sum \limits_{y'} \delta_{x',y'}^{l+1} \frac{\partial(\sum\limits_{a}\sum\limits_{b}w_{a,b}^{l+1}\sigma(z_{x'-a, y'-b}^l) + b_{x',y'}^{l+1})}{\partial z_{x,y}^l} = \sum \limits_{x'} \sum \limits_{y'} \delta_{x',y'}^{l+1} w_{a,b}^{l+1} \sigma'(z_{x,y}^l)

If x=x'-a and y=y'-b then it is obvious that a=x'-x and b=y'-y so we can reformulate above equation to:

  \sum \limits_{x'} \sum \limits_{y'} \delta_{x',y'}^{l+1} w_{a,b}^{l+1} \sigma'(z_{x,y}^l) =\sum \limits_{x'}\sum \limits_{y'} \delta_{x',y'}^{l+1} w_{x'-x,y'-y}^{l+1} \sigma'(z_{x,y}^l) 

OK, our last equation is just …

  \sum \limits_{x'}\sum \limits_{y'} \delta_{x',y'}^{l+1} w_{x'-x,y'-y}^{l+1} \sigma'(z_{x,y}^l)= \delta^{l+1} * w_{-x,-y}^{l+1} \sigma'(z_{x,y}^l)

Where is the rotation of weights? Actually ROT180(w_{x,y}^{l+1}) = w_{-x, -y}^{l+1}.

So the answer on question Hello, when computing the gradients CNN,  the weights need to be rotated, Why ?  is simple: the rotation of the weights just results from derivation of delta error in Convolution Neural Network.

OK, we are really close to the end. One more ingredient of backpropagation algorithm is update of weights \frac{\partial C}{\partial w_{a,b}^l} :

 \frac{\partial C}{\partial w_{a,b}^l} = \sum \limits_{x} \sum\limits_{y} \frac{\partial C}{\partial z_{x,y}^l}\frac{\partial z_{x,y}^l}{\partial w_{a,b}^l} = \sum \limits_{x}\sum \limits_{y}\delta_{x,y}^l  \frac{\partial(\sum\limits_{a'}\sum\limits_{b'}w_{a',b'}^l\sigma(z_{x-a', y-b'}^l) + b_{x,y}^l)}{\partial w_{a,b}^l} =\sum \limits_{x}\sum \limits_{y} \delta_{x,y}^l \sigma(z_{x-a,y-b}^{l-1}) = \delta_{a,b}^l * \sigma(z_{-a,-b}^{l-1}) =\delta_{a,b}^l * \sigma(ROT180(z_{a,b}^{l-1})) 

So paraphrasing the backpropagation algorithm  for CNN:

  1. Input x: set the corresponding activation a^1 for the input layer.
  2. Feedforward: for each l = 2,3, …,L compute  z_{x,y}^l = w^l * \sigma(z_{x,y}^{l-1}) + b_{x,y}^l and  a_{x,y}^l = \sigma(z_{x,y}^l)
  3. Output error  \delta^L: Compute the vector  \delta^L = \nabla_a C \odot \sigma'(z^L)
  4. Backpropagate the error: For each l=L-1,L-2,…,2 compute \delta_{x,y}^l =\delta^{l+1} * ROT180(w_{x,y}^{l+1}) \sigma'(z_{x,y}^l)
  5. Output: The gradient of the cost function is given by  \frac{\partial C}{\partial w_{a,b}^l} =\delta_{a,b}^l * \sigma(ROT180(z_{a,b}^{l-1}))

The end 🙂