Quick guide to maxout networks

Dear {{reader_name}},
You wanted to know more about the maxout network and how to code one in Tensorflow. Then I will tell you what I know about it after reading the paper by Goodfellow et al..

A simple definition of a maxout network is a feedforward neural network which uses maxout units for activation function.
However, the main difference with the another activation function like Tanh or ReLu is than this unit can create his nonlinear function. In fact, the maxout is inspired by the Stone-Weierstrass approximation theorem that states that any function can be approximated by a set of linear functions. For example, if we use three straight lines to approximate the quadratic function we obtain this figure :

3 lines approximating a quadratic function defined on [-30, 30]

As you see, if we extend to more than two dimensions, the maxout unit can approximate more complex and convex functions.

Now let’s see the maxout architecture.
Given an input \(x \in R^d\) an output of dim \(R^m\), a maxout layer implements the function

\(h_i(x) = \max_{j \in [1,k]} z_{ij}\) where \( z_{ij} = x^TW_{:ij} + b_{ij} \)

\(W \in  R^{d \times m \times k}\) and \(b \in R^{m \times k}\) are the learnable parameters, and \(k\) is the number of linear we want to use.

So we just want the max output among all our linear functions. We can mixte all the linears into one by defining W of dimension \([d, m*k]\).

Below you can find my implementation in Tensorflow to get a simple approximation of the quadratic function. It ‘s just the computational graph. All the script and the pytorch version can be found here.

The next thing to see is the relation between the maxout and the dropout. For recall, the dropout simulates a bag of neural networks by dropping a random subset of neurons from the original neural network at each training iteration. Thus a new kind of neural network is trained, and all the networks share the same weights.

During the test phase, we need to average the predictions of all the networks but, instead of a bagging model, we have just one model, and it is not feasible to explicitly average all the variants. So the trick is to divide the weights by 2 when running the full model. (The justification is in [2] but I didn’t read it.)

However, this scaling method is not accurate for models using nonlinear activation functions with significant curvature like Tanh. Networks of linear operations and max(.) are best to exploit the dropout’s approximation model averaging.

Now, I suppose you would like to know the advantage of maxout over ReLu. The main difference is the rectifier units include a zero in the max(.), and this zero will affect the optimization process. Without dropout, there is no problem : the rectifier units saturate less than 5%. With dropout training, the saturation rate increases by 55%. The zero in the \(\max(0,z)\) is blocking the gradient. It is difficult to change the unit to become active again because zero is a constant. For a maxout unit, a zero is a function and can be adjusted.

I guess your next question is about the difference between training with and without dropout. The key is the learning rate. For an ordinary SGD, we use a small learning rate to decrease the objective function gradually while we use a large learning rate for a dropout training. Remember ! Dropout is a bagging-like model, and it requires the gradient to change noticeably; causing the rectifier unit to saturate.

Bibliographie

[1] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza,Aaron Courville, Yoshua Bengio. Maxout Networks, arXiv:1302.4389, 2013

[2] Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580, 2012