Let’s create Magic cards with a DCGAN

These days, I was interested in Generative Adversarial Network (GAN) and wanted to create something fun: use a Deep Convolutional Generative Adversarial Network (DCGAN) to generate Magic cards. Basically, in this algorithm we have two neural networks: the Discriminator which must distinguish the fake and real cards; and the Generator which must create fake cards to fool the Discriminator. The magic part in this algorithm is that we give a random vector to Generator to create a persuasive image.

GAN Architecture

First of all, we will need data. There are tons of fans of Magic The Gathering so it was easy to find this site and scrape it to get images of good resolution. Then we must sort the cards a bit because there are some special template like the ones for the Planeswalkers or old edition.

MGT common card template
Most common card template

MGT old edition
Old card template

MGT planeswalker template
Planeswalker template

After training on the most common cards template of Magic The Gathering, I get the following results.

Magic The Gathering cards created by DCGAN
Magic The Gathering cards created by DCGAN

If you look carefully, you can see that some cards have mana costs. We can almost read Creature or Instant on the type line. We also have the power and toughness associated to a creature card and an unclear symbol for the expansion sets.


For the neural networks, I resized all the images to 224x224px. It is easier to work with square images. Moreover, it’s small enough to fit in my GPU and to blur the text. I also scale the images to the range [-1, 1] because of the Tanh activation function used for the Generator.


Now, I will show you the main part of my DCGAN implementation in Tensorflow. We will start with the Discriminator implementation because it is like implementing an image classifier.

Comments on the Discriminator:

  • I use five convolutional layers.
  • For GAN, we need the LeakyReLU activation for all layers to avoid sparse gradient. It is primordial to train the Generator.
  • I also use batch normalization.
  • Dropout is needed to avoid the Discriminator to overfit the data.


The goal of the Generator is to produce counterfeit images that is similar to real images. The input \(Z\) will be a vector generated from a normal distribution and the output channel will be the depth of the final image : three for RGB image or one for grayscale image.

Comments on the Generator:

  • I am applying noise before the “deconvolutional” layers
  • Then I am applying batch normalisation before the activation function
  • LeakyReLU activation is used with a tiny alpha (0.1) except for the last layer which have a Tanh activation
  • Dropout allow better realistic image.


Next we will define the loss function. We need labels but these labels are very simple to define. For the Discriminator, we will set 1 (real) for all inputs which come from our dataset and 0 (fake) for those which come from the Generator. For the Generator, the labels is set to 1 because its goal is to fool the Discriminator.

After that, I am using the cross entropy between the Discriminator’s predictions and the labels.

Comment on the losses:

  • The variable smooth1 (resp. smooth0) generates just random values which will be subtracted (resp. added) to the labels. This technique is called Label Smoothing and is used to prevent the very large gradient signal.


For the training, we use Adam optimizer. The training loop is almost like all the machine learning training part except I unbalanced the training between the Discriminator and the Generator. For each batches I am training the Discriminator until its smaller than 2. I observed than if the Discriminator’s loss is too big, then the Generator doesn’t make effort to create realistic image. To force the Generator to create better realistic images, for each batches I am also training the Generator until his loss is smaller than the Discriminator’s loss.

It is an equilibrium : if one player “win” too often, the other doesn’t want to play.

Comments on the training:

  • To generate the above images I start with 100 epochs then continue for 50 epochs.
  • Beyond 150 epochs, the system loose its stability.

Additional notes to improve the training:

  • Some hacks recommend to use an average pooling layer for the Discriminator instead of fully connected layer.
  • Make the learning rate bigger for the discriminator.
  • Figuring out the correct training parameters is harsh. Inspiration from the latest papers can be a very good start.
  • Move dropout between [0.3, 0.5]
  • Move Momentum between [0.2, 0.5]

To go further

The complete implementation can be found on my github.

This post is concise because there are tons of tutorials and resources to get a better understanding of GANs or find practical hacks :


Magic The Gathering cards : https://scryfall.com/

GAN Architecture picture : https://twitter.com/ch402/status/793911806494261248

Quick guide to maxout networks

Dear {{reader_name}},
You wanted to know more about the maxout network and how to code one in Tensorflow. Then I will tell you what I know about it after reading the paper by Goodfellow et al..

A simple definition of a maxout network is a feedforward neural network which uses maxout units for activation function.
However, the main difference with the another activation function like Tanh or ReLu is than this unit can create his nonlinear function. In fact, the maxout is inspired by the Stone-Weierstrass approximation theorem that states that any function can be approximated by a set of linear functions. For example, if we use three straight lines to approximate the quadratic function we obtain this figure :

3 lines approximating a quadratic function defined on [-30, 30]

As you see, if we extend to more than two dimensions, the maxout unit can approximate more complex and convex functions.

Now let’s see the maxout architecture.
Given an input \(x \in R^d\) an output of dim \(R^m\), a maxout layer implements the function

\(h_i(x) = \max_{j \in [1,k]} z_{ij}\) where \( z_{ij} = x^TW_{:ij} + b_{ij} \)

\(W \in  R^{d \times m \times k}\) and \(b \in R^{m \times k}\) are the learnable parameters, and \(k\) is the number of linear we want to use.

So we just want the max output among all our linear functions. We can mixte all the linears into one by defining W of dimension \([d, m*k]\).

Below you can find my implementation in Tensorflow to get a simple approximation of the quadratic function. It ‘s just the computational graph. All the script and the pytorch version can be found here.

The next thing to see is the relation between the maxout and the dropout. For recall, the dropout simulates a bag of neural networks by dropping a random subset of neurons from the original neural network at each training iteration. Thus a new kind of neural network is trained, and all the networks share the same weights.

During the test phase, we need to average the predictions of all the networks but, instead of a bagging model, we have just one model, and it is not feasible to explicitly average all the variants. So the trick is to divide the weights by 2 when running the full model. (The justification is in [2] but I didn’t read it.)

However, this scaling method is not accurate for models using nonlinear activation functions with significant curvature like Tanh. Networks of linear operations and max(.) are best to exploit the dropout’s approximation model averaging.

Now, I suppose you would like to know the advantage of maxout over ReLu. The main difference is the rectifier units include a zero in the max(.), and this zero will affect the optimization process. Without dropout, there is no problem : the rectifier units saturate less than 5%. With dropout training, the saturation rate increases by 55%. The zero in the \(\max(0,z)\) is blocking the gradient. It is difficult to change the unit to become active again because zero is a constant. For a maxout unit, a zero is a function and can be adjusted.

I guess your next question is about the difference between training with and without dropout. The key is the learning rate. For an ordinary SGD, we use a small learning rate to decrease the objective function gradually while we use a large learning rate for a dropout training. Remember ! Dropout is a bagging-like model, and it requires the gradient to change noticeably; causing the rectifier unit to saturate.


[1] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza,Aaron Courville, Yoshua Bengio. Maxout Networks, arXiv:1302.4389, 2013

[2] Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580, 2012