# Convolutional Autoencoder for Dummies

Each day, I become a bigger fan of Lasagne. Recently, after seeing some cool stuff with a Variational Autoencoder trained on Blade Runner, I have tried to implement a much simpler Convolutional Autoencoder, trained on a lot simpler dataset – mnist. The task turned out to be a really easy one, thanks to two existing in Lasagne layers: Deconv2DLayer and Upscale2DLayer . My Convolution Autoencoder consists of two stages:

1. Coding consists of convolutions and maxpoolings
2. Decoding consists of upscalings and deconvolutions.

Some thought experiment, that must be processed to realize how easy it is, is to realize that deconvolutions are just convolutions! What is more, if somebody read my post Convolutional Neural Networks backpropagation: from intuition to derivation then he or she saw this concept in the backpropagation phase!

Citing myself (I feel really embarrassed now for this didactic tone …):

Yeah, it is a bit different convolution than in previous (forward) case. There we did so called valid convolution, while here we do a full convolution (more about nomenclature here). What is more, we rotate our kernel by 180 degrees. But still, we are talking about convolution!

The Upscale operation seems to be obvious, so I think there is no magic now and we can go into code. As you can see, the Autoencoder is a very symmetric beast. I tried to show it in the snippet:

def build_convolutional_autoencoder(input_var=None):

l_in = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),
input_var=input_var)

auto_conv1A = lasagne.layers.Conv2DLayer(
l_in, num_filters=32, filter_size=(5, 5),
nonlinearity=lasagne.nonlinearities.tanh,
W=lasagne.init.GlorotUniform())

auto_maxpool1A = lasagne.layers.MaxPool2DLayer(auto_conv1A, pool_size=(2, 2))

auto_conv2A = lasagne.layers.Conv2DLayer(
auto_maxpool1A, num_filters=32, filter_size=(5, 5),
nonlinearity=lasagne.nonlinearities.tanh)

auto_maxpool2A = lasagne.layers.MaxPool2DLayer(auto_conv2A, pool_size=(2, 2))

auto_dense1A = lasagne.layers.FlattenLayer(auto_maxpool2A) # 512 neurons

auto_dense2 = lasagne.layers.DenseLayer(
lasagne.layers.dropout(auto_dense1A, p=.5),
num_units=256,
nonlinearity=lasagne.nonlinearities.tanh)

auto_dense1B = lasagne.layers.DenseLayer(
lasagne.layers.dropout(auto_dense2, p=.5),
num_units=512,
nonlinearity=lasagne.nonlinearities.tanh)

auto_dense1B = lasagne.layers.ReshapeLayer(auto_dense1B, (,32, 4, 4))

auto_maxpool2B = lasagne.layers.Upscale2DLayer(auto_dense1B, 2) # 2 - scale factor

auto_conv2B = lasagne.layers.Deconv2DLayer(auto_maxpool2B, auto_conv2A.input_shape, auto_conv2A.filter_size, stride=auto_conv2A.stride, crop=auto_conv2A.pad, W=auto_conv2A.W, flip_filters=not auto_conv2A.flip_filters)

auto_maxpool1B = lasagne.layers.Upscale2DLayer(auto_conv2B, 2) # 2 - scale factor

auto_conv1B = lasagne.layers.Deconv2DLayer(auto_maxpool1B, auto_conv1A.input_shape, auto_conv1A.filter_size, stride=auto_conv1A.stride, crop=auto_conv1A.pad, W=auto_conv1A.W, flip_filters=not auto_conv1A.flip_filters)

return auto_conv1B


And here some reconstructions (right pictures):          Hope, you like it. I know, it is not a Variational Autoencoder (work in progress), it is not even a Denoising Autoencoder, but the results seem to be quite fine. What’s more, it took only 75 minutes on laptop to get them 🙂

# The simple example of Theano and Lasagne super power

I mentioned in my initial post “Deep Learning Frameworks Overview” that my choice of Deep Learning library is (at least for now)  Theano and Lasagne combination. However, I did not use, in all my post, some most important word: the experiment. So, let’s assume that you have some idea and want to test it quickly. For example, what if we add to a standard CNN (omitted maxpooling for a clarity): some extra “convolutional branch”, that is concatenated with last but one layer: This experiment is really easy to do in (based on Theano) Lasagne. I have just added a build_modified_cnn  method to a mnist example (bolded text refers to my “convolutional branch”, rest is the same as a standard build_cnn method):

def build_modified_cnn(input_var=None):
l_in = lasagne.layers.InputLayer(shape=(None, 1, 28, 28),
input_var=input_var)

l_conv1 = lasagne.layers.Conv2DLayer(
l_in, num_filters=32, filter_size=(5, 5),
nonlinearity=lasagne.nonlinearities.rectify,
W=lasagne.init.GlorotUniform())

l_conv1A = lasagne.layers.Conv2DLayer(
l_in, num_filters=32, filter_size=(10, 10),
nonlinearity=lasagne.nonlinearities.rectify,
W=lasagne.init.GlorotUniform())

l_maxpool1A = lasagne.layers.MaxPool2DLayer(l_conv1A,
pool_size=(5, 5))

l_dense1A = lasagne.layers.FlattenLayer(l_maxpool1A)

l_maxpool1 = lasagne.layers.MaxPool2DLayer(l_conv1,
pool_size=(2, 2))

l_conv2 = lasagne.layers.Conv2DLayer(
l_maxpool1, num_filters=32, filter_size=(5, 5),
nonlinearity=lasagne.nonlinearities.rectify)

l_maxpool2 = lasagne.layers.MaxPool2DLayer(l_conv2,
pool_size=(2, 2))

l_dense1 = lasagne.layers.DenseLayer(
lasagne.layers.dropout(l_maxpool2, p=.5),
num_units=256,
nonlinearity=lasagne.nonlinearities.rectify)

l_concat = lasagne.layers.ConcatLayer([l_dense1,
l_dense1A])

l_dense2 = lasagne.layers.DenseLayer(
lasagne.layers.dropout(l_concat, p=.5),
num_units=10,
nonlinearity=lasagne.nonlinearities.softmax)

return l_dense2


We see here some standard layers such as Conv2DLayer and MaxPool2DLayer and less standard however self-explanatory ones: FlattenLayer and ConcatLayer. Some results of the test on mnist dataset:

 Type Epoch Accuracy Time Standard CNN 1 56.75 % 6.767s Standard CNN 30 96.76 % 6.923s Modified CNN 1 72.51 % 9.552s Modified CNN 30 96.03 % 9.592s

Not this time. My modification made a significant progress, if we look only at first epoch. However, in general, it learns slower than a standard one. But to see it, I needed only a few minutes of coding – and this is a true power of Theano and Theano-based libs! # Convolutional Neural Networks backpropagation: from intuition to derivation

Disclaimer: It is assumed that the reader is familiar with terms such as Multilayer Perceptron, delta errors or backpropagation. If not,  it is recommended to read for example a chapter 2 of free online book ‘Neural Networks and Deep Learning’ by Michael Nielsen.

Convolutional Neural Networks (CNN) are now a standard way of image classification – there are publicly accessible deep learning frameworks, trained models and services. It’s more time consuming to install stuff like caffe than to perform state-of-the-art object classification or detection. We also have many methods of getting knowledge -there is a large number of deep learning courses/MOOCs, free e-books or even direct ways of accessing to the strongest Deep/Machine Learning minds such as Yoshua Bengio, Andrew NG or Yann Lecun by Quora, Facebook or G+.

Nevertheless, when I wanted to get deeper insight in CNN, I could not find a “CNN backpropagation for dummies”. Notoriously I met with statements like:  “If you understand backpropagation in standard neural networks, there should not be a problem with understanding it in CNN” or “All things are nearly the same, except matrix multiplications are replaced by convolutions”. And of course I saw tons of ready equations.

It was a little consoling, when I found out that I am not alone, for example: Hello, when computing the gradients CNN,  the weights need to be rotated, Why ? The answer on above question, that concerns the need of rotation on weights in gradient computing, will be a result of this long post.

We start from multilayer perceptron and counting delta errors on fingers: We see on above picture that $\delta^1_1$ is proportional to deltas from next layer that are scaled by weights.

But how do we connect concept of MLP with Convolutional Neural Network? Let’s play with MLP:

If you are not sure that after connections cutting and weights sharing we get one layer Convolutional Neural Network, I hope that below picture will convince you:

The idea behind this figure is to show, that such neural network configuration  is identical with a 2D convolution operation and weights are just filters (also called kernels, convolution matrices, or masks).

Now we can come back to gradient computing by counting on fingers, but from now we will be only focused on CNN. Let’s begin:

No magic here, we have just summed in “blue layer” scaled by weights gradients from “orange” layer. Same process as in MLP’s backpropagation. However, in the standard approach we talk about dot products and here we have … yup, again convolution:  Yeah, it is a bit different convolution than in previous (forward) case. There we did so called valid convolution, while here we do a full convolution (more about nomenclature here). What is more, we rotate our kernel by 180 degrees. But still, we are talking about convolution!

Now, I have some good news and some bad news:

1. you see (BTW, sorry for pictures aesthetics 🙂 ), that matrix dot products are replaced by convolution operations both in feed forward and backpropagation.
2. you know that seeing something and understanding something … yup, we are going now to get our hands dirty and prove above statement 🙂 before getting next, I recommend to read, mentioned already in the disclaimer, chapter 2 of M. Nielsen book. I tried to make all quantities to be consistent with work of Michael.

In the standard MLP, we can define an error of neuron j as: $\delta^l_j = \frac{\partial C}{\partial z^l_j}$

where $z^l_j$ is just: $z^l_j = \sum\limits_{k} w^l_{jk} a^{l-1}_k + b^l_j$

and for clarity, $a_j^l = \sigma(z_j^l)$ , where $\sigma$ is an activation function such as sigmoid, hyperbolic tangent or relu.

But here, we do not have MLP but CNN and matrix multiplications are replaced by convolutions as we discussed before. So instead of $z_j$  we do have a $z_{x,y}$: $z_{x,y}^{l+1} = w^{l+1} * \sigma(z_{x,y}^l) + b_{x,y}^{l+1} = \sum \limits_{a} \sum \limits_{b} w_{a,b}^{l+1}\sigma(z_{x-a,y-b}^l)+ b_{x,y}^{l+1}$

Above equation is just a convolution operation during feedforward phase illustrated in the above picture titled ‘Feedforward in CNN is identical with convolution operation’

Now we can get to the point and answer the question Hello, when computing the gradients CNN,  the weights need to be rotated, Why ?

We start from statement: $\delta_{x,y}^l = \frac{\partial C}{\partial z_{x,y}^l} =\sum \limits_{x'} \sum \limits_{y'}\frac{\partial C}{\partial z_{x',y'}^{l+1}}\frac{\partial z_{x',y'}^{l+1}}{\partial z_{x,y}^l}$

We know that $z_{x,y}^l$ is in relation to $z_{x',y'}^{l+1}$ which is indirectly showed in  the above picture titled ‘Backpropagation also results with convolution’. So sums are the result of chain rule. Let’s move on: $\frac{\partial C}{\partial z_{x,y}^l} =\sum \limits_{x'} \sum \limits_{y'}\frac{\partial C}{\partial z_{x',y'}^{l+1}}\frac{\partial z_{x',y'}^{l+1}}{\partial z_{x,y}^l} = \sum \limits_{x'} \sum \limits_{y'} \delta_{x',y'}^{l+1} \frac{\partial(\sum\limits_{a}\sum\limits_{b}w_{a,b}^{l+1}\sigma(z_{x'-a, y'-b}^l) + b_{x',y'}^{l+1})}{\partial z_{x,y}^l}$

First term is replaced by definition of error, while second has become large because we put it here expression on $z_{x',y'}^{l+1}$. However, we do not have to fear of this big monster – all components of sums equal 0, except these ones that are indexed: $x=x'-a$ and $y=y'-b$. So: $\sum \limits_{x'} \sum \limits_{y'} \delta_{x',y'}^{l+1} \frac{\partial(\sum\limits_{a}\sum\limits_{b}w_{a,b}^{l+1}\sigma(z_{x'-a, y'-b}^l) + b_{x',y'}^{l+1})}{\partial z_{x,y}^l} = \sum \limits_{x'} \sum \limits_{y'} \delta_{x',y'}^{l+1} w_{a,b}^{l+1} \sigma'(z_{x,y}^l)$

If $x=x'-a$ and $y=y'-b$ then it is obvious that $a=x'-x$ and $b=y'-y$ so we can reformulate above equation to: $\sum \limits_{x'} \sum \limits_{y'} \delta_{x',y'}^{l+1} w_{a,b}^{l+1} \sigma'(z_{x,y}^l) =\sum \limits_{x'}\sum \limits_{y'} \delta_{x',y'}^{l+1} w_{x'-x,y'-y}^{l+1} \sigma'(z_{x,y}^l)$

OK, our last equation is just … $\sum \limits_{x'}\sum \limits_{y'} \delta_{x',y'}^{l+1} w_{x'-x,y'-y}^{l+1} \sigma'(z_{x,y}^l)= \delta^{l+1} * w_{-x,-y}^{l+1} \sigma'(z_{x,y}^l)$

Where is the rotation of weights? Actually $ROT180(w_{x,y}^{l+1}) = w_{-x, -y}^{l+1}$.

So the answer on question Hello, when computing the gradients CNN,  the weights need to be rotated, Why ?  is simple: the rotation of the weights just results from derivation of delta error in Convolution Neural Network.

OK, we are really close to the end. One more ingredient of backpropagation algorithm is update of weights $\frac{\partial C}{\partial w_{a,b}^l}$: $\frac{\partial C}{\partial w_{a,b}^l} = \sum \limits_{x} \sum\limits_{y} \frac{\partial C}{\partial z_{x,y}^l}\frac{\partial z_{x,y}^l}{\partial w_{a,b}^l} = \sum \limits_{x}\sum \limits_{y}\delta_{x,y}^l \frac{\partial(\sum\limits_{a'}\sum\limits_{b'}w_{a',b'}^l\sigma(z_{x-a', y-b'}^l) + b_{x,y}^l)}{\partial w_{a,b}^l} =\sum \limits_{x}\sum \limits_{y} \delta_{x,y}^l \sigma(z_{x-a,y-b}^{l-1}) = \delta_{a,b}^l * \sigma(z_{-a,-b}^{l-1}) =\delta_{a,b}^l * \sigma(ROT180(z_{a,b}^{l-1}))$

So paraphrasing the backpropagation algorithm  for CNN:

1. Input x: set the corresponding activation $a^1$ for the input layer.
2. Feedforward: for each l = 2,3, …,L compute $z_{x,y}^l = w^l * \sigma(z_{x,y}^{l-1}) + b_{x,y}^l$ and $a_{x,y}^l = \sigma(z_{x,y}^l)$
3. Output error $\delta^L$: Compute the vector $\delta^L = \nabla_a C \odot \sigma'(z^L)$
4. Backpropagate the error: For each l=L-1,L-2,…,2 compute $\delta_{x,y}^l =\delta^{l+1} * ROT180(w_{x,y}^{l+1}) \sigma'(z_{x,y}^l)$
5. Output: The gradient of the cost function is given by $\frac{\partial C}{\partial w_{a,b}^l} =\delta_{a,b}^l * \sigma(ROT180(z_{a,b}^{l-1}))$

The end 🙂 # Deep Learning Frameworks Overview

I have some experience with caffe and it was my main tool for research in area of Music Information Retrieval. However, Deep Learning is not reduced to Convolution Neural Networks and caffe is not suitable for fast, prototype implementations. So I was faced with the question: What is the best Deep Learning framework?

Before google-it let’s  quora-it. We can easily find a related question: Which is the best deep learning framework Theano Torch7 or Caffe ? I recommend to read all this thread, but here I copy-paste some interesting parts:

If one wants to code up the entire algorithm for specific problem Theano is the quickest to get started with. It gives a comprehensive control over Neural Network formation . The reason we use Theano at ParallelDots is that the Neural Networks we make had no standard implementations and hence Theano was the best way to prototype them .

if you want to do more fundamental work like changing loss function or introducing some optimization constraint, you have to go to Theano (…) but I would like to warn you about complexities of Theano. It might happen that you waste 3 months just to understand the nity gritty of codes, by the time research has moved ahead.

Theano is very easy and quick to build back propagation. Torch7 is more transparent.

However, both question and answers do not mention about Google piece of cake – TensorFlow. So let’s quora-it again. Now we get quries like: Is TensorFlow better than other leading libraries such as Torch Theano? What is unique about Tensorflow from the other existing Deep Learning Libraries?

TensorFlow is both an R&D and deployment framework. It can be deployed on phones too. For rest of the features, it is more or less like Theano.  So yes, it more or less subsumes Theano and Torch’s features.
For new projects I can see a rapid adoption of TensorFlow somewhat beating others. Old practitioners who have been working on Theano/Torch will continue to use these frameworks. At least for me, Theano fulfils pretty much all requirements and I dont have anything which I need and I already know how to program it.

TensorFlow performs non trivially worse than its competitors, in both speed and memory usage, Google is working on fixing this. I wouldn’t be surprised at performance parity in a couple of releases.
Benchmark TensorFlow · Issue #66 · soumith/convnet-benchmarks

The Hacker’s Machine Intelligence Platform just trolls Tensorflow.

I think there are two main differences at the moment, comparing it to the more mainstream libraries:
1. The visualization module (TensorBoard): One of the main lacking areas of almost all open source Machine Learning packages, was the ability to visually model and follow the computation pipeline.
2. The all-in-one hardware implementation approach: The libraries can be deployed in all kinds of hardware, from mobile devices to more powerful heterogeneous computing setups.

Lack of Symbolic loops (“scan” in Theano). Googles white paper mentions several control flow operations, but they are not ready yet.

Subgraph Execution is awesome. Being able to introduce and retrieve the results of discretionary data on any edge of the graph introduces considerable debugging potential into TensorFlow. I truthfully cannot undersell how useful this is, I can see on the fly execution of sub components making its way into my workflow nicely.

In summary:

• Tensor Flow is not a winner of speed.
• Tensor Flow is not only R&D framework, but deployment platform too.
• Great Visualization module Tensor Board.

Similar observation can be made by watching Justin Johnson lecture:

My own impressions of Justin lecture:

1. Justin does not mention about a prosaic TensorFlow constraint – if you do not have a good GPU, you will not be able to install Tensor Flow (my GeForce GT 755M  is not enough)
2. I am disappointed  by Torch. Lua appears to be evil (I do not know what is worse, global variables,  or indexing from 1 !?). Torch is also not recommended  for Recurent Neural Networks.
3. Andrei shows that debugging is cumbersome, however
a) there are not shown debugging options of Theano
b) I believe Theano’s problems with debugging is related to computation
graphs paradigm that is common with Tensor Flow (‘graphs-code independence’).
4. At the end (use-case Batch-norm), Andrei recommends to use Torch when we do want efficient backprop and Theano or Tensor Flow, when do not want to derive analytic equations. However in  Matrix factorization with Theano there is shown that:

there doesn’t seems to be any gain in analytically deriving a function with respect to using the automatic derivation capabilities of Theano.

Nevertheless, I like this lecture very much. Here you can find slides – on 145 there is a table with overview.

Coming back to Theano, there are many Theano-based libs. Probably, the easiest way to start training CNN or RNN is to use keras (which as backend can also use Tensor Flow !). One of the main features of keras is an abstraction – hiding backend (Theano/Tensor Flow). However, sometimes we want to get our hands dirty and have an easier access to Theano. In this case, probably Lasagne is a better choice.

So after all, my choice of Deep Learning framework (after longer journey with caffe, and shorter with keras) will be probably Theano + Lasagne + nolearn (helper functions around Lasagne). It is very probable that later I will switch to Tensor Flow which is really tempting even today, but Theano-based toolbox is convincing from next 3 reasons:

1. Theano is public from 2010 (google-it is more powerful)
2. I am also intrested in Bayesian Approach and pymc3
3. I believe that ‘graphs computation paradgim’ is something more than code, so switching to Tensor Flow is easier for Theano expert (and such switching I do not exclude).

At the end, if you are interested in Theano/Lasagne, here I list links to interesting educational materials:
Theano, a short practical guide, presentation made by Emmanual Bengio
From multiplication to convolutional networks (Theano presentation, codes)
Using convolutional neural nets to detect facial keypoints tutorial (blog post of Daniel Nouri, Lasagne+nolearn)
Recurrent Neural Networks Tutorial (great series about making RNN in numpy and Theano by Denny Britz)
Neural networks with Theano and Lasagne (Theano/Lasagne tutorial by Eben Olson)

Have fun!

Edit: A really good answer has appeared on Quora just now (6.05.2016) so I feel forced to place it here: Is TensorFlow better than other leading libraries such as Torch/Theano?