## Let’s create Magic cards with a DCGAN

These days, I was interested in Generative Adversarial Network (GAN) and wanted to create something fun: use a Deep Convolutional Generative Adversarial Network (DCGAN) to generate Magic cards. Basically, in this algorithm we have two neural networks: the Discriminator which must distinguish the fake and real cards; and the Generator which must create fake cards to fool the Discriminator. The magic part in this algorithm is that we give a random vector to Generator to create a persuasive image.

First of all, we will need data. There are tons of fans of Magic The Gathering so it was easy to find this site and scrape it to get images of good resolution. Then we must sort the cards a bit because there are some special template like the ones for the Planeswalkers or old edition.

 Most common card template Old card template Planeswalker template

After training on the most common cards template of Magic The Gathering, I get the following results.

If you look carefully, you can see that some cards have mana costs. We can almost read Creature or Instant on the type line. We also have the power and toughness associated to a creature card and an unclear symbol for the expansion sets.

### Preprocessing

For the neural networks, I resized all the images to 224x224px. It is easier to work with square images. Moreover, it’s small enough to fit in my GPU and to blur the text. I also scale the images to the range [-1, 1] because of the Tanh activation function used for the Generator.

### Discriminator

Now, I will show you the main part of my DCGAN implementation in Tensorflow. We will start with the Discriminator implementation because it is like implementing an image classifier.

• I use five convolutional layers.
• For GAN, we need the LeakyReLU activation for all layers to avoid sparse gradient. It is primordial to train the Generator.
• I also use batch normalization.
• Dropout is needed to avoid the Discriminator to overfit the data.

### Generator

The goal of the Generator is to produce counterfeit images that is similar to real images. The input $$Z$$ will be a vector generated from a normal distribution and the output channel will be the depth of the final image : three for RGB image or one for grayscale image.

• I am applying noise before the “deconvolutional” layers
• Then I am applying batch normalisation before the activation function
• LeakyReLU activation is used with a tiny alpha (0.1) except for the last layer which have a Tanh activation
• Dropout allow better realistic image.

### Losses

Next we will define the loss function. We need labels but these labels are very simple to define. For the Discriminator, we will set 1 (real) for all inputs which come from our dataset and 0 (fake) for those which come from the Generator. For the Generator, the labels is set to 1 because its goal is to fool the Discriminator.

After that, I am using the cross entropy between the Discriminator’s predictions and the labels.

Comment on the losses:

• The variable smooth1 (resp. smooth0) generates just random values which will be subtracted (resp. added) to the labels. This technique is called Label Smoothing and is used to prevent the very large gradient signal.

### Training

For the training, we use Adam optimizer. The training loop is almost like all the machine learning training part except I unbalanced the training between the Discriminator and the Generator. For each batches I am training the Discriminator until its smaller than 2. I observed than if the Discriminator’s loss is too big, then the Generator doesn’t make effort to create realistic image. To force the Generator to create better realistic images, for each batches I am also training the Generator until his loss is smaller than the Discriminator’s loss.

It is an equilibrium : if one player “win” too often, the other doesn’t want to play.

• To generate the above images I start with 100 epochs then continue for 50 epochs.
• Beyond 150 epochs, the system loose its stability.

Additional notes to improve the training:

• Some hacks recommend to use an average pooling layer for the Discriminator instead of fully connected layer.
• Make the learning rate bigger for the discriminator.
• Figuring out the correct training parameters is harsh. Inspiration from the latest papers can be a very good start.
• Move dropout between [0.3, 0.5]
• Move Momentum between [0.2, 0.5]

### To go further

The complete implementation can be found on my github.

This post is concise because there are tons of tutorials and resources to get a better understanding of GANs or find practical hacks :

##### Sources:

Magic The Gathering cards : https://scryfall.com/

## Thesis proposal on video game playing

This year, I applied for the IGGI programme and submitted a PhD proposal titled Transfer in Deep Reinforcement Learning for General Video Game Playing. Although I was in the short-list with other good candidates, I didn’t get a scholarship to work on my project.

Nevertheless, my work can still inspire others. So for those who are curious, you can find it here.

## Essential math definitions in reinforcement learning

Markov decision process, value functions, Bellman equation…
This post will present all we need to understand the mathematical foundations of reinforcement learning. Don’t worry, it is still easy!

As we have seen, three signals are passing back and forth between an agent and its environment :

• the actions which represent the choices made by the agent
• the states which refer to whatever available information for making decisions
• the rewards which define the agent’s goal

You also have two types of tasks: episodic one and continuing one. An episodic task merely is characterized by the presence of a terminal state, such as a game over. When the agent is in a terminal state, it is the end of an episode; then we must start the agent-environment interaction again to get a new episode. Continuing tasks are tasks that go on without limit.

So far, we know the agent’s goal is to maximize the reward it receives in the long run. Formally what it means?

In an episodic task, we know that the interaction will stop at a time $$T$$. Then we can define the expected return as the sum of future reward: $$G_t = R_{t+1}+ R_{t+2}+\dots+R_{T}$$. However, in continuing task, $$T=\infty$$; so the previously expected return is problematic. The solution implies an additional concept called discounting. We will discount distant rewards until zero. Thus, the agent will maximize the expected discounted return.

$$G_t = R_{t+1}+ \gamma R_{t+2}+\gamma^2 R_{t+3}+\dots = \sum_{k=0}^{\infty} \gamma^kR_{t+k+1}$$

where $$\gamma$$, the discount rate, is a parameter and $$0 \leq \gamma \leq 1$$. Discounting deals with the “eyesight” of the agent. If $$\gamma$$ is close to 0, it will be only concerned by the immediate reward, whereas for a gamma close to 1, the agent becomes farsighted!

As you can expect, we will use the notion later. It is just one piece of the math puzzle. Here comes another called Markov decision process.

First of all, we will consider a recycling robot task to illustrate.

Its goal is to search paper balls in an office. At each time step, it must decide whether it should search paper balls on the floor, wait for someone to bring it a paper ball or go on the charging base. However, it must also consider the state of the battery.

The following transition graph summarizes the dynamic of the problem.

Now how to read it? The agent starts in a state $$s$$ and takes an action $$a$$. So it follows one of the green lines from state node $$s$$ to action state $$(s,a)$$. Then the environment reacts with a transition to a next state’s node via one of the orange lines leaving action node $$(s,a)$$.

In reinforcement learning, we suppose that our environment has the Markov property. It means that the future is independent of all the previous events and depends only on the current state. Formally, a state $$S_t$$ is Markov if and only if $$Pr\{S_{t+1}|S_t\} = Pr\{S_{t+1}|S_0,S_1,\dots, S_t\}$$.

Most of the RL tasks are defined as finite Markov decision process, and because of that, we have two other mathematical definitions:

• Given a state $$s$$ and an action $$a$$, the transition probability of each possible next state is $$p(s’|s,a)=Pr\{S_{t+1}=s’| S_t=s, A_t=a\}$$
• Similarly, given a state $$s$$ and an action $$a$$ and a next state $$s’$$, the expected value of the next reward is $$r(s,a,s’)=\mathbb{E}[ R_{t+1} | S_t=s, A_t=a, S_{t+1}=s’]$$

Presently, I labeled the orange arrows with the transition probabilities and the expected reward for each transaction.

To make the point, the agent must maximize the return on the long run, can expect a reward for being in the next state, and we can have the transition probabilities from the state-action node. Ok! But how the agent know what a good state is?
For example, if I play to golf and my ball is in the sand, it is an undesirable situation because I will need a lot of moves to return the ball to the grass.

In RL, we use two value functions to express this concept. We can decompose them in terms of expected return $$G_t$$ with respect to the policy $$\pi$$ (function that maps state into action).

• The value of a state $$s$$, $$v_{\pi}(s)$$, is the expected return when starting in $$s$$ and following the policy thereafter: $$v_{\pi}(s)=\mathbb{E}_{\pi}[G_t | S_t = s]$$. Note that for a terminal state, the value is zero.
• We define similarly the action-value function as $$q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_t | S_t = s, A_t=a]$$

These two functions can be estimated from experience because of the Bellman equation! By using the previous definitions, the Bellman equation for $$v_{\pi}$$ expresses the relationship between the value of a state and the values of its successor states.

$$\displaystyle v_{\pi}(s) = \sum_a \pi(a|s)\sum_{s’}p(s’|s,a)[r(s,a,s’) + \gamma v_{\pi}(s’)$$

In fact, the Bellman equation averages all the future possibilities and weights each by its transition probability. This formula is useful because it is the basis of a several ways to compute, approximate and learn $$v_{\pi}$$.

To conclude, we have seen all the fundamental mathematical definitions to move forward. To recall, we defined the expected return $$G_t$$, the transition probabilities $$p(s’|s,a)$$, the expected value of the next reward $$r(s,a,s’)$$, the value functions $$v_{\pi}$$ and $$q_{\pi}$$ and introduced the Bellman equation.

See you for the next post !

###### Sources

[1] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction.

## An elementary post about reinforcement learning

Today, I wanted to present you the reinforcement learning. This framework is meant to resolve goal-oriented problem by interaction with an environment. Thus, the learner must discover which actions will allow it to achieve a goal.

I will introduce you a simple example where you will be the learner. Have you ever tried this hidden game in the Chrome browser?

If not, turn off your Wi-Fi, try a web page and when the dinosaur appears, press the spacebar. It is a platform game, and you will immediately try to get your best score. For this, you will observe the dinosaur running, and press the spacebar to avoid the cactus. If you do so, your score continues to increase. It is your reward for choosing the right action at the right time.

As you see, if we replace you by an artificial intelligence, there is no supervisor in reinforcement learning! Instead, the agent (our A.I.) has the reward signal from what it can judge the progress towards its goal. In this game, we have the reward quickly, but in many situations, this feedback is delayed (think about a puzzle game). So, the time matters in two cases: to predict the reward the agent will receive and to decide the right action at the right time. It implies to take into account the delayed consequences of actions.

So much to digest! I will summarize with this schema :

At time t, the agent observes the environment which at a state $$s_t$$. It gets an observation $$o_t$$. Then it executes an action $$a_t$$ in its environment and gets a reward $$r_t$$. This reward can be negative. Its role is to say what the good and bad events are for the agent. But, the agent rarely reaches its goal the first try. Hence, it will use its experience to adjust its actions during the next try. We let it do this trial-and-error process until it comes to its goal.

You see! Easy! Now I will go one step further into a reinforcement learning system. In the following, we suppose that the agent can totally observe its environment; thus $$o_t=s_t$$. Besides the elements aforementioned, there are three main components: the policy, the value function and the model.

• The policy defines how the agent behaves. It maps each state to action. It is the core of a reinforcement learning agent. I will write more about it in another post.
• The value function defines how good each perceived state is. It estimates the total amount of rewards the agent can expect in the future. Note that the agent will reevaluate this value all its lifetime.
• The model is optional and is used for planning. It will predict the next state of the environment and the next immediate reward. However, we can also design the model to decide a course of actions by considering possible future situations.

I hope that all this is clear! You will need it for the following! With these three components, we can define several kinds of agents:

• The value-based agent uses only the value function for learning. The policy is implicit; it chooses the action that gives the more estimated reward.
• The policy-based agent uses only the policy component. You can think about selecting a direction to leave a maze as an example of behavior.
• The actor-critic agent uses both policy and value function.

The type mentioned above is model-free; no model is involved. Model-based agent work with the three components.

At this moment, you have all the knowledge required for being introduced to the exploitation-exploration dilemma! Imagine you are out in the city and you are hungry. You have the choice of several restaurants. If you choose your favorite restaurant, you are using your own experience. You know the food is delicious at this restaurant, and you will get what you expect. Yet, if you want something new, you will try a new restaurant. Maybe this new restaurant will become your new favorite! As you have probably guessed, the first situation is called exploitation because it uses known information to maximize reward. The second one is called exploration because it collects experience and knowledge of the surroundings.

In summary, all reinforcement learning agents have explicit goals, can sense all or partially their environments, and can choose actions to influence their world. It emphasis interaction without relaying an exemplary supervision and this why it is a fascinating paradigm.

See you for the next post !

###### Sources

[1] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction.