This time I took an interest in the MineRL competition on "Sample Efficient Reinforcement Learning using Human Prior". It is a competition based on the Minecraft game where the agent has to obtain a diamond. For recall, to get one, the agent will have to craft several items as sticks and pickaxes and be able to navigate in a mine. Then, some key challenges here will be to implement long-term planning and an efficient exploration method. We don't want the agent to explore all the crafting recipes!

To gradually experiment, the organizers prepared sub-task environments as navigate to a destination or chopping several tree blocks. Moreover, the organizers gave a dataset of human plays for each sub-task. After looking a bit to the data, I found out each human demonstration consists of a **video**, a **json**, and **npz** (Numpy zipped archive) files.

Before training an agent, a good thing is to explore the dataset as we will count on them. Let's see what we got!

Drawing to illustrate the tasks available in the MineRL competition

There are five kinds of tasks: Navigate, Navigate Extreme that spawns the agent on extreme hill biome, Obtain Pickaxe, Obtain Diamond and Treechop.

All except Treechop comes into two variants: Sparse and Dense, In the "Dense" variant, the agent receives a reward at every tick for how much closer it is from the target.

Note: The environment spawns the agent in a survival map.

We have 1525 videos split among these environments.

To explore the video type file, I started by looking at the metadata. I use the hachoir library to extract it. Below you can see the distribution of the resolution and the times for all the videos. The most part of the videos have a resolution 64x64 resolutions, but some have a higher resolution. 64x64 is a proper resolution for a neural network. The 256x192 videos resolutions, I will see if I resize or discard them or something else.

Distributions of the metadata found for all the videos.

hachoir can extract more like the frame rate, but in this case, it wasn't available in the metadata.

Now, I need to know a bit more on the content of the videos. For that, I **extracted all the frames **of the videos to return to an image analysis setting. Then to visualize the distribution, I made an **embedding projection**. More precisely, I sampled 1024 images, retrieved the output of a pre-trained ResNet18, and project on 2D using UMAP (Uniform Manifold Approximation and Projection). We get the following result for the videos demonstrations of the ObtainDiamond environment.

Umap projection of sampled frames of the ObtainDiamond environment.

That not bad, but the pictures are overlapping, and we cannot inspect the hidden ones. After a quick search, I found how to fit the projection into a square grid using the **Jonker-Volgenant algorithm** (a linear assignment problem solver).

UMAP projection on a grid layout.

There is not light in a mine, and the visualization reminds this to me. A part of the dataset is really too dark to see through it. The agent should be able to craft a torch and place it to see in the dark to see the diamond block!

Plotting the content of the JSON file is interesting because it encloses the data if the human succeed the task or not (because of a bug or whatever).

Histogram of the boolean variable “success” on the entire dataset.

As we see, some data can be left over. Around 20% of the entire dataset didn't accomplish the task. I will clean it to avoid misleading demonstrations.

NPZ is a Numpy archive file that consists of subfiles with Numpy arrays. In fact, this is sequence data: the action, inventory, and reward saved for each frame.

I plotted some charts for each, but it is more **informative combined with the associated the frame**. I used the Holoviews lib with Bokeh to get the navigation by frame. The main pro to this library is that I can add <div> HTML block and put anything that can be copy-paste later.

Observation-actions pairs sequence visualization of a human play

This visualization doesn't give an overall view of the dataset, but it will allow me to inspect a sequence of action-observation pairs more carefully if needed.

The state space depends on the considered environment. For example, in the "navigate" environment, a compass angle is returned; in the "obtain pickaxe" environment, the object held in the right-hand figures in environment observation.

Regarding the action space, we find the directional movement with the jump, sprint, sneak, the camera, ... All these controls are discrete except for the camera that expects continuous values.

Most of these visualizations can be used again for another machine learning project. Thus, I regrouped them into a repository I called Joligraf.

With this competition, the problem is posed as an imitation learning problem, but for baseline, I will first pose it a supervised learning problem. Since we have data, I will start by finding a good state representation or augment it.

We will see more in imitation learning and what I found soon! To be continued…

Found an article interesting and/or useful? Please consider following me to be alerted when I post new content!

You can also support me with a retweet!

retweet]]>Let's start with how the idea of distributional reinforcement learning came about with the Monopoly game. In the figure below, I show you a piece of the board where your **token is on Luigi**. Somebody else is owning the "Mario" and "Luigi" squares and have put hotels on these properties.

Two possible things can happen: you land on Mario, and you have to pay $2,000 in rent, or you pass the GO space and earn some extra cash.

After rolling the dice, you can end up paying a rent or earning some cash.

In term of reinforcement learning, we'll look at the expected reward that's going to happen in this case.

$$\begin{split}

\mathbb{E}[R(x)] & = \frac{1}{36} \times (-2000) + \frac{35}{36} \times 200 \\

& \approx 139

\end{split}$$

The odds of landing on Mario space is pretty slim, so we end up with a positive expected bonus of $139.

Of course, the RL problem is more complex than just looking at the prediction on the immediate reward. We are going to look at some future interactions, then discount and sum the rewards as we encounter them.

$$R(x_0) + \gamma R(x_1) + \gamma^2 R(x_2)$$

In the traditional approach, we formalize that we are computing the expected return for the state \(x\) as the expected sum of discounted rewards that's accumulated along the way.

$$

V^\pi(x) = \mathbb{E} \left[\sum_{t=0} \gamma^t R(x_t) \: \middle| \: x_0 = x\right]

$$

However, if we think back to our situation mentioned above in the Monopoly game, the expected return of $139 will never occur. There is some randomness in these returns, and by modeling as we do, we hide this intrinsic randomness.

The work of Bellemare, Dabney et al. were to get rid off the expectation in the Bellman equation to work on the full distribution of the return. It also means to define a random variable \(Z\). Our distributional Bellman equation is almost the same, except that we have three random variables: the reward \(R\), the next state \(x'\) and its random return \(Z(x')\).

$$Z^\pi(x) \stackrel{D}{=} R(x) + \gamma Z^\pi (X') \quad \quad X' \backsim P^\pi( \cdot |x)$$

Three random variables: the reward \(R\), the next state (here \(d\) or \(e\)) and the return distribution of each of the next state.

** Note**: For the following, I will

Since we will work with distributions, we have to choose how we will represent them. Bellemare, Dabney and al released three papers since 2017 about their work on three different distribution representations: they started with the categorical description, then used the inverse cumulative density function with specific quantiles, and went further by adding a parametric probability in their model.

Categorical distribution representation (left) and cumulative density function (right)

Let's see the categorical distributional deep Q-learning quickly. Here, we will define the support for our distribution \(Z\) such as \( {z_0, z_1, \cdots, z_{n_atoms}} \). In DQN, we were computing for each action. It will be quite the same here: we will calculate a discrete distribution for each action. However, it means we have to choose a probability distance for the loss. In the paper, they use the KL divergence since the support is fixed. They also obtained their best result on Atari using 51 atoms, hence the name of the algorithm, C51.

Categorical DQN, the support is fixed and we learn a distribution for each action.

I was first intrigued by quantile regression when I saw the name on OpenAI SpinningUp website and it is what motivate me to do this post. Here we do the opposite of the categorical DQN: the support is variable, and we fixed the probabilities. More precisely, we cut the distribution into \(N\) portions of equal probabilities mass or quantiles, and we define the quantile target \(\hat\tau\) as the median of each quantile \(\tau\). In the illustration below, I divided the distribution into four slices and places \(\hat\tau\), as the median of each slice.

Cumulative distribution function (CDF) of the predicted and target distribution

We trained the neural networks using the Huber quantile regression loss defined as follow:

$$\rho_\tau(\delta_t) = | \tau - \mathbb{I}\{\delta_t< 0\}| \cdot \underbrace{\mathcal{L}(\delta_t)}_{\text{Huber loss}}$$

$$\mathcal{L}(\delta_t) = \begin{cases} \frac{1}{2}\delta_t^2 &\quad \text{if } |\delta_t| < 1 \\|\delta_t| - \frac{1}{2} &\quad \text{otherwise } \end{cases}$$

$$ \delta_t = r + \gamma Z'(x_{t+1}, a^*)) - Z(x_t,a_t) $$

The Huber loss is a smooth L1 loss. You can spot the difference in the figure below.

L1 loss (orange) and the Huber loss (blue)

When I started implementing the quantile regression DQN, I choose to test it on the Lunar Lander environment, available in OpenGym. The goal is simple, and the state representation is a vector containing the velocity of the object, its angle and angular velocity, if its legs are touching the ground, etc.

The Lunar Lander environment

One issue I faced is that in the loss, I was making a simple vector difference between the output of my network and the one of the target network. Let's say that \(\theta_0, \dots, \theta_n\) are the atoms returned by the network for an action \(a\) and \(\theta'_0, \dots, \theta'_n \) those of the target network.

\(\theta'_i - \theta_i\) doesn't work because there is no guarantee that the output is ordered, i.e., that at the index \(i\), it corresponds to the same quantile. It is mainly the case when we start training. Instead, we must compare each atom against each other: for each \(j\) from 0 to \(N\): \( \theta'_i - \theta_j\).

During training, the atoms will be moved to the right place by following the indication of \(\tau\) present in the loss.

Learned distribution for each action in two different situations.

In the figure above, you can visualize the CDF for each action in two situations. I used five points to describe the distribution over returns. Some distributions have a support on the return space very tight. It could mean that the agent is more confident on the possible outcomes.

After seeing the result on Lunar Lander, I thought my implementation was good enough to use it on Atari. Papers are benchmarking on this platform so it will be an excellent mean to check if I am right. At the start, I thought l have just a few modifications to do to test it on Pong or Space Invaders. It didn't work out as I expected. In fact, after hours trying to debug, I learned the existence of wrappers used over the Atari games when RL researchers are experimenting on it.

For examples:

**EpisodicLifeEnv**: Convert individual lives in the game into separate episodes.**FireResetEnv**: Pressing FIRE at the beginning of the game**ProcessFrame84**: Convert input observation to grayscale 84x84**ClipRewardEnv**: Clipping the reward to -1, 0 and 1 values. Why? The obtained score can vary among the games.

I also got memory issues, mostly because I was stocking a lot more data in the replay buffer. In the Lunar Lander, the state vector has eight values, that I stack with several past to augment the state representation. Passing to Atari games, the vector space is now images and takes a lot more space, especially if I expand the state representation with the past images.

Hyperparameters has to change too because the settings were reframed.

After all these issues, I think to start over on a better base: this time, I use a reinforcement learning toolkit, named ptan, to focus on more on the quantile regression concept and less on core implementation. I also started to use W&B, to save experiments progress. I use the Pong environment to try out this new implementation. Finally, I didn't want to wait too long the end of a run, so I took an interest in DQN extensions to speed up the learning.

Screenshot of my wandb dashboard with the summary plotting and the hyperparameters

For the first speed up, I followed the first suggestion of this blog. The trick is to set a train frequency: a batch is seen all the \(N\) frames instead of each frame. However, I have to tweak the hyperparameters carefully. To get the same thing as if I trained each time, I multiplied the batch size, the learning rate, and the other optimizer hyperparameters by the train frequency I have set.

Then I use an extension called N Steps, where we look at the reward for \(N\) consequent steps. In classical DQN, we are just updating with immediate reward, and here we are using the reward of two or three steps ahead.

$$\begin{split} Q(s_t, a_t) & = r_t + \gamma \max_a Q(s_{t+1}, a_{t+1}) \\ & = r_t + \gamma \max_a[r_{a,t+1} + \gamma \max_{a'} Q(s_t+2, a')] \\ & = r_t + \gamma r_{t+1} + \gamma^2 \max_{a'} Q(s_{t+2}, a') \end{split}$$

The last extension I used is Double DQN. What changed is we choose an action for the next state using the trained network but still take value from the target network.

$$ \begin{split} Q(s_t, a_t) & = r_t + \gamma \max_a Q'(s_{t+1}, a_{t+1}) \\ & = r_t + \gamma \max_a Q'(s_{t+1}, \mathrm{arg}\max_a Q(s_{t+1}, a)) \\ \end{split}$$

Speed up methods is good, but hyperparameters impact a lot on the convergence time. For that, I use those from this paper, that writes down all the hyperparameters for DQN, Rainbow, C51 and such in the Annexe section.

Now that all is set, let’s visualize on the Freeway environment. Freeway is a game where the agent control a chicken that want to cross the highway. The chicken can only move up or down. We get a point when the chicken is on the other side of the road. You can see the network in action with the following video.

The code can be found on my github if you are interested.

I didn’t implement this one, but the idea extends the Quantile Regression network described above. In the previous algorithm, \(\tau\) was fixed whereas, in the implicit quantile network (IQN), we add \(\tau\) as input in the network. The latter will be sampled from a uniform distribution between 0 and 1. We also come back to a vector of the action size as in DQN instead of a matrix like in QR-DQN. During training and for selecting an action, we will sample several times \(\tau\) to get the shape of the state-action return distribution.

IQN network architecture with parametric \(\tau\)

The thing I like in this paper is the way where create an embedding for \(\tau\). Rather than use an MLP embedding, they augment the \(\tau\) value with a cosine basis function:

$$ \phi(\tau) = \text{ReLU}(\sum_{i=0}^{n-1} \cos(\pi i \tau)w_i + b) $$

with \(n\) the embedding size.

In this project, we have seen a lot: how DQN was extend to distributional reinforcement learning with different distribution representations. If you use the Atari environment, don’t forget the wrappers. I waste a lot of time trying to debug and it was mainly the setting of the environment.

Below I share a few useful links I gathered and you can also find the code I use here.

Marc G. Bellemare, Will Dabney: “A Distributional Perspective on Reinforcement Learning”, 2017; arXiv:1707.06887.

Will Dabney, Mark Rowland, Marc G. Bellemare: “Distributional Reinforcement Learning with Quantile Regression”, 2017; arXiv:1710.10044.

Marc G. Bellemare, Will Dabney: “A Distributional Perspective on Reinforcement Learning”, 2017; arXiv:1707.06887.

Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar: “Dopamine: A Research Framework for Deep Reinforcement Learning”, 2018; arXiv:1812.06110.

Youtube video _ A Distributional Perspective on Reinforcement Learning - Marc Bellemare

Banner image: Atari 2600 Gems by Kreg Steppe / CC BY-SA 2.0

Found an article interesting and/or useful? Please consider following me to be alerted when I post new content!

You can also support me with a retweet!

]]>In this post, we will see how to build this learning environment and to use it with Python. The Banana Collector environment is an example from the ml-agents repository that I modified to be more straightforward. I will show you from start to end how to do it.

Tutorial made with **Unity 2017.4.xx LTS** and **mlagents 0.7.0**.

For this tutorial, we will need the Unity Hub.

Once installed and opened, go in the **Installs > Official Releases** to install the Unity 2017.4.xx LTS release. Be careful to add "**Linux Build Support**" in the installer wizard.

Unity Hub - Installing 2017.4.xx LTS release

Unity Hub - Checking the Linux Build Support

For the following, we will use the environments examples available in the ml-agents repository.

We will also need to download the TensorFlow Sharp plugin.

After creating a new project, you should get something similar to the following image.

Unity interface with annotations

We will now set some values in the **Project Settings**.

Go to

**Edit > Project Settings > Player**.In the

**Inspector panel**, under the**Resolution and Presentation**, check the*Run in Background*option.In the

**Other Settings**tab, add*ENABLE_TENSORFLOW*to the**Scripting Define Symbols**and set the**Scripting Runtime Version**to .*NET 4.6*(it will ask you to restart Unity).

*[Click to zoom]*

Going to Player Settings

Checking "Run in Background"

ENABLE_TENSORFLOW

Setting .NET 4.6

That's it for the project settings. Now we will open an example and do some change before exporting our learning environment.

From the ml-agents repository that you downloaded, drag and drop the content of **UnitySDK/Assets/** in the **Project Panel**. You will also need to install the *TFSharpPlugin.unitypackage*. You can also use a drag and drop movement to do it.

Now we can open and modify the Banana Collector example.

Open the

*Banana*scenes from**Assets > ML-Agents > Examples > BananaCollectors > Scenes**You should have content in the

**Hierarchy**panel.

Installing the Unity examples

Installing the TFSharp plugin

Selecting Banana Collector scene

We can simplify the project by removing in the Hierarchy panel, the *RLArea(x) *and the *Agent(x)* under RLArea to keep just one agent.

For testing the environment, we need to select the *Agent* object in the **Hierarchy panel**. You should see in the** Inspector** some information. In this panel, under **Banana Agent (Script)**, set **Brain** value to *BananaPlayer*.

Finding Banana Agent in the Inspector

Changing Brain for BananaPlayer

Now we will look at the information contained in the *BananaPlayer*. For that, in the project panel, open *BananaPlayer* in **Assets > ML-Agents > Examples > BananaCollectors > Brains**.

Under **Discrete Player Actions**, you can explore the key-mapping or change it if you like. Now give it a try to play yourself the environment by clicking on **Play**.

Keys to play the environment

According to the description of the Banana Collector environment in Unity doc examples the vector action space has 11 possibles actions organized in four branches. We can recover this definition in the **Inspector** of *BananaPlayer*, under *Brain Parameters*.

Vector Action details

However, we want something manageable like a vector with five possible actions: Forward, Backward, Rotate Left, Rotate Right and No Action.

We will see how to modify the code.

In **Assets > ML-Agents > Examples > BananaCollectors > Scripts**, open *BananaAgent.cs*. It will launch MonoDevelop

Opening the *BananaAgent.cs* script

From line 100 to 142, there is the implementation of the vector action. We will modify this block to get one branch and five actions: No Action (0), Forward (1), Backward(2), Rotate Clockwise and Rotate Counter-wise.

Changing the vector action definition and key-mapping

Now that we have modified the code, we must also change the *Vector Actions* definition in **BananaPlayer**. Don't forget to edit the **Discrete Player Actions **according to the previous settings if you want to test the environment again.

/!\ Once we are satisfied, we can report the vector action definition in the ** BananaLearning** brain object. /!\

We will set the control of the agent to *BananaLearning*. For that, select the **Agent** in the **Hierarchy** panel and in the** Inspector**, set the **Brain** value to *BananaLearning*.

We will also allow an external program to take control of the agent. Select **Academy** in the **Hierarchy** panel, then checks the *Control* checkbox in the **Banana Academy** section.

*[Click to zoom]*

Changing from BananaPlayer to BananaLearning

Allowing Control from external program

Now we will produce an executable program to use it with Python. We will go in **File > Build Settings …**

If there is nothing in the **Scenes In Build**, click on *Add Open Scenes**.*

Select your **Target Platform** and then click on *Build.*

You should get a `<name_build>_Data/`

folder and a `<name_build>.<architecture>`

executable.

Build setting window

We will need to install the Unity ML-Agents Toolkit.

`pip install mlagents`

When I made this post, it was the 0.7.0 version.

Change the environment filename with the filename of your build executable.

If you run `python main.py`

, you should see your agent moving randomly in the environment.

Banana Collector environment

If you get the following error

`AttributeError: 'UnityEnvironment' object has no attribute 'communicator',`

that means that the environment is not closed yet. You can wait a bit or change the *worker_id* if you are in a hurry.

**That’s it! Have fun with Unity and training your agent(s)!**

Let's start with the definition. Reinforcement learning is a type of machine learning where an agent learns how to behave in an environment to maximize its performance at a task. Thus the agent interacts at each time step by receiving a state or observation of its environment and acting in it.

It also gets feedback on its action by a reward.

Reinforcement learning framework

The particularity of this framework are:

* The agent learns an optimal policy \(\pi\) (aka behavior) by trial and error.

* It needs feedback on its action

* Its action affect the future state it receives

That's said, let's see a grid world example.

Dinosaur want to go to the beach

Here a dinosaur that wants to go to the beach. Our dinosaur starts in state 1 and must go to state 4. Unfortunately, there is a wall between state 1 and 4, so it must figure out how to reaches its goal.

It will be too easy if the world doesn't have its dynamics. For each direction chosen by our dinosaur, there are 70% of chance the environment moves it in that direction and 10% of luck the dinosaur ends up in another direction.

Keep in mind that the environment's dynamics stay hidden from the agent.

World’s dynamics

As said earlier, our dinosaur also needs feedback, so we have set the reward to 10 for the beach and -1 for all the other states. We use negative rewards to ensure that the agent will go to the terminal state as quickly as possible.

Grid world reward

Ok, now we can start to talk about Monte Carlo methods.

We will begin with a random policy to collect episodes, complete sequences of interactions, from start to finish.

What to do with it? If we look at the first collected episode, we can now compute the cumulative reward G we get for each state-action pair.

Then we can start to fill a table of relationships between the states and the actions.

With all the collected episodes, we can fill the rest by averaging the cumulative rewards. In the end, a new policy \(\pi^{\prime}\) will emerge by taking the action that maximizes the return.

More formally, we estimated the action-value function \(q_\pi\) in a table called the Q-Table, where each case contains an expected return.

Until now, we compute the Q-Table after a lot of episodes. However, we can collect one, update it and reuse the policy for the next episodes.

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \frac{1}{N(S_t,A_t)}(G_t - Q(S_t, A_t))$$

\(Q(S_t, A_t)\) is the expected return at time \(t\). \(N(S_t,A_t)\) is the number of update for the \((S_t,A_t)\) pair. \(G_t\) is the cummulative reward.

Nevertheless, we have seen two policies, one random and the other estimated from the Q-Table. Which one to use? If we choose the best action all the time, we can not be sure that our action is really the best. We are in the exploration-exploitation dilemma!

The role of the exploitation is to maximize the reward based on what the agent already knows. The purpose of the exploration is to improve the agent's knowledge. We can not exploit all the time. We can not explore all the time.

To deal with it, we use an epsilon-greedy policy.

We set an epsilon \(\epsilon\) and choose with a probability of \(\epsilon\) the random policy and with a probability of \(1-\epsilon\) the greedy policy. At start of the learning, \(\epsilon\) is set to 1 then decay until it reaches a small value.

For recall, we use Incremental Mean to update our Q-Table gradually with the following equation:

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \frac{1}{N(S_t,A_t)}(G_t - Q(S_t, A_t))$$

The problem is that the 1000th update doesn't count much. For recall, the environment's dynamics are unknown, and we likely prefer to stay up-to-date in our interactions with it. Thus we will change the update factor to be proportional to \(\alpha\) instead of \(\frac{1}{N(S, A)}\).

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \alpha (G_t - Q(S_t, A_t))$$

Unfortunately, Monte Carlo methods can only be used with a complete episode before updating the Q-Table. Moreover, that implies that it can not be used with a non-terminal environment.

It is where Temporal-Difference methods intervene. In TD algorithms, the Q-Table is updated after each step. Because of that, we can not compute the cumulative reward G. It will be replaced by the following equation from the SARSA algorithm.

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t))$$

Here \(\gamma\) is called the discount rate. In the previous example, we implicitly used a \(\gamma\) value of 1, because we care about the future. If \(\gamma\) value is 0, then the agent only cares about the immediate reward.

We can do better: update the Q-Table before choosing the next action. Thus we get the Q-learning equation! The idea behind is that the next move will be selected according to the greedy-policy if it's not the random policy.

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \alpha (R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t))$$

However, Q-Table is not scalable. That's why we use an approximation function as a neural network to estimate the expected return for each action from a state. (Note: for the following, I assume you know how supervised deep learning works)

Our update equation become

$$\Delta w = \overbrace{\alpha}^{\text{learning rate}} (\overbrace{R_{t+1} + \gamma \max_a Q(S_{t+1}, a)}^{\text{TD target}} - \overbrace{Q(S_t, A_t)}^{\text{current predicted Q-val}}) \overbrace{\nabla_w Q(S_t, A_t)}^{\text{gradient of the current Q-val}}$$

Sadly, because of the interactive nature of our framework, the neural network has some instability while learning. It can be encountered with two well-known tricks: the experience replay and fixed Q-targets.

The idea of **experience replay** is straightforward. Instead of forgetting all the interaction our agent have with the environment, we can store it in a queue to sample data when optimizing the neural network. We use a queue to use a queue to save only the most recent interactions. It contains more valuable information that an old random interaction at the start. Think of it as our evolution from childhood to adulthood.

The second trick is to fix the **Q-targets**. Let's look at the equation again.

$$\Delta w = \alpha (R_{t+1} + \gamma \max_a \underbrace{\hat{Q}(S_{t+1}, a)}_{\text{Fixing target by another network}} - Q(S_t, A_t)) \nabla_w Q(S_t, A_t)$$

Our target \(\hat{Q}(S_{t+1}, a)\) is evolving at the same time we try to optimize toward it. The solution is to use another network that doesn't change during the learning step.

In practice, we use two separate networks with identical architecture.

The primary Q-Network is used to be optimized as we saw. The second Q-Network, also called target Q-Network, is used to fix the Q-targets.

This latter is updated less often or more slowly than the primary Q-Network by the following formula.

$$\hat{Q}(s,a) = \tau \cdot \hat{Q}(s,a) + (1-\tau) \cdot Q(s,a)$$

We have seen the basis of Deep Q Learning, but there are a lot of ways to improve the algorithm. Here some examples:

* Replay Network: Prioritized Experience Replay

* Exploration: Noisy Net

* Optimization: Multi-steps learning or Double Q-learning

* Architecture: Dueling Network

* Distributed learning

To finish, I would like to share with you a demo of a DQN agent learned on a Unity environment have compiled. The agent's goal is to eat the most the yellow bananas in 300 steps. It observes the environments through ray casting and for each banana eaten, a new one appears randomly in the environment

I hope you liked this introduction! Next time, I will show you how to compile a Unity environment and use it with python.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. & Hassabis, D. (2015).

**Human-level control through deep reinforcement learning.***Nature*, 518, 529--533.Deep Reinforcement Learning Nanodegree, Udacity

All start with geologists who wanted to find vast deposits of salt below the Earth's surface. So they use seismic reflection, a method of exploring the Earth's crust by generating an ultrasound image of the subsurface.

The dataset I had for this competition is images of 101x101 pixels with each pixel classified as salt or not. Thus the goal of the contest is to segment regions that contain salt.

Seismic images. The salt regions are in green

That said, I look at my images: some of them have just a tiny portion of salt that we could not notice with a quick look, some have salt portion following a curve in the image, some have no salt at all or the opposite. I also get 4000 images for my training set, and I must predict 18 000 images. There another challenge: my training set is much smaller than my test set.

Now how the participants are ranked? According to the description, the score is the intersection over union metric. This latter considers a set of pixels (the ones I propose as salt) and compares it to another set of pixels (the ones which are salt).

IoU for bounding boxes showing a poor, good and excellent prediction.

Source: By Adrian Rosebrock

It is a significant metric to assess performance in segmentation task, and it would be useful to compute a loss with it. Nevertheless, this metric is non-differentiable, and our optimization method needs a differentiable metric to learn from it.

In this competition, I studied and tested several loss functions to indicate to my model where it was wrong. I started with a log loss (or binary cross-entropy).

Negative log loss

Classic but I would like to emphasize the mispredicted a bit more. Indeed, if my model predicts the correct class with a probability of 0.7, it doesn't need to push the likelihood to 0.99. Thus I considered the focal loss that allows controlling the slope of the log loss.

Focal loss with different values of gamma

Source: *From Focal Loss with Dense Object Detection*

These losses are good in pixel-wise but poor indicator of the quality of the segmentation. Thus I looked for more IoU based objective functions.

With the truth labels and predicted labels, the IoU is defined as

IoU produces a value between 0 and 1. For this equation works, the mask set and the prediction set need to be in 1's and 0's. Mask contains these values, but Prediction holds probabilities between 1 and 0. To convert it into a binary value, we need to use the argmax function or threshold, and these two are not differentiable.

So I got interested in the Dice coefficient which is similar to IoU but can be differentiated. The Dice loss is defined by

It was used in medical application to deal with unbalanced classes. You can get more info with by reading " *V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation** " .*

Finally, some discussion during the competition mentioned the Lovasz-Hinge loss. By reading the referenced paper, "*The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks*" , I didn't fully understand how it works, mostly because I think I lack knowledge about the mathematical properties of convex functions/geometry. However, I used it, and in practice, I got a better result.

During this challenge, I mostly combined losses. For a while, I use focal loss with the dice loss to get a pixel-wise loss with a segmentation loss. Then when I discover Lovasz-hinge, I combined log loss with Lovasz-hinge has suggested in the paper.

Before going down in the architecture of the neural net, I will mention briefly the data augmentation I use. I tried several ways to augment my 3600 images (the 400 images left was for validation). Finally simple was better, so I use horizontal flip and translation + reflection to increase by three my training set.

Concerning the architecture of the neural net, it is based on the U-Net architecture introduced by "*U-Net: Convolutional Networks for Biomedical Image Segmentation*". It is a fully convolutional network, composed with two main parts: a downsampling and upsampling part. [IMAGE] The particularity is that U-net is symmetric and it creates 'skip connections' between the downsampling path and upsampling path. The goal of these skip connection is to provide local information while upsampling.

Classic U-Net architecture

I quickly customized the block of layers (purple) at each level to improve my model.

A ResNet block

I used residual block in my model architecture. For recall, when the network gets deeper, we get vanishing gradient and degradations problems. The residual block resolve this problem by creating a shortcut that allows

to push the residual branch to 0 is the input form is already optimal

to bring better gradient value to the input.

A colleague tells me another idea name Squeeze and Excitation. The main idea is to produce channel-wise weights to convolution layer. For example: In an RGB image, if we are looking for something yellow, we will loads the red and green channels more than the blue channel. The same idea is applied to the output of a convolutional layer.

That's said, I combine the two concepts and get the following block.

Scheme on how I combined Residual block with Squeeze + Excitation concepts

This block is used before reducing the image shape by max pooling in the encoder part and before upscaling the image in the decoder part.

In my architecture, I also initialized the weights with **He normal** that worked better than **Xavier uniform** initialization. After some reading, I find that **He normal** was built for ReLU activation in mind whereas Xavier was thought for Tanh, sigmoid activation functions.

I also would like to note that I modified the U-net to keep the 101x101 px original size.

I wanted to monitor the progress made by my model with the IoU score, so at the end of each epoch, the score was computed on all the validation set (and not progressively on each batch of data). For a while, I trained a model from scratch with an Adam optimizer, reduce the learning rate by LR * 0.7 when the metric did not improve on the validation set, change a thing and start this process again.

Ensemble methods have already proved their efficiency in machine learning problems. However learning takes so long, how can I have a better model quicker? I get inspired by this paper: "Snapshot Ensembles: Train 1, get M for free".

Illustration of classic SGD vs Snapshot Ensembling

So for a model, I train it again and again: once it arrives in a local minimum with no improvement, I save the model, reinitialize the learning rate and start over with the weights of the last model. It is slightly different from the algorithm described in the paper which saves a model all the N epochs and which reduces the learning rate with a cyclic schedule. This method didn't work with Adam optimizer, and this post blog will explain you why.

Before this competition, I just retrieved the prediction of my model and them submit it the Kaggle platform. Meantime, I learned that I could do some simple post-processing, without adding cost: Test time augmentation. The principle is simple, I take my image, flip it and feed it to my model and then flip the result.

Test Time Augmentation used for the TGS Salt Identification challenge

Now with a picture, I can get several predictions and average this ensemble to improve my final score.

I am going further: with ensemble snapshot, I could average together the predictions of my models. Nevertheless, some models are better than others, so how I can get the most of it? By letting a linear model weight the first models for me.

Stacking models: technique that combines multiple models via a meta-learner

After all the things I introduced, I will summarize with a scheme of the entire pipeline.

Data pipeline for the TGS Salt Identification challenge

At the end of this competition, I have several remarks to make.

My first comment is for Adam optimizer. Adaptive learning rate should give an edge over SGD but it doesn't work with snapshot ensemble, and it seems that in practice, by observing the other competitors, SGD with momentum works better.

My second remark goes for Keras. Finally, I got this impression to add a black box on my model training. When I use the fit method, finally I can not monitor what's going on with a specific variable without adding callback functions. Moreover last implementations of researchers use the low-level API of Tensorflow or Pytorch. I think next time I will use the low-level API of Tensorflow to get better control of the training and my variables.

My next point concerns a random seed problem. I wanted my model to be deterministic to see the influence of a parameter. So I set the Numpy random seed by np.random.seed(1) and the Tensorflow seed by tf.set_random_seed(2). Nevertheless, it doesn't work. Before my next competition, I must find a way to get a deterministic model to be able to reproduce results and also to add unit tests.

My last notes are on my workflow.

I am using Git to have a source code history, but I didn't do it for models that can be too big or/and too many to send on Github.

My experiments take time and sometimes I forgot what I already did. I should take time to do better documentation and track my work a bit more closely

Crashes append, and in the long run, it can waste a lot of time. I should find a way to unit test my code to improve my trust and confidence in it.

The visualization wasn't enough. For my next machine learning challenge, I should think what I need to visualize to understand what is going on.

The ideal workflow would imply to have a snapshot before and after an experiment.

Ideal data workflow

People who use online GPUs have all these features, why can I have them on my machine? Have you see there is a lot to improve.

At the end of this challenge, the final ranking is calculated with 66% of the test data. I get the 316th place of 3234 at this competition.

Final ranking for the TGS Salt Identification Challenge

"

*The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks*", https://arxiv.org/abs/1705.08790v2"

*U-Net: Convolutional Networks for Biomedical Image Segmentation*", https://arxiv.org/abs/1505.04597"

*Focal Loss for Dense Object Detection*", https://arxiv.org/abs/1708.02002"

*V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation*", https://arxiv.org/abs/1606.04797"

*Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification*", https://arxiv.org/abs/1502.01852"

*Snapshot Ensembles: Train 1, get M for free*", https://arxiv.org/abs/1704.00109

GAN architecture

First of all, we will need data. There are tons of fans of Magic The Gathering so it was easy to find this site and scrape it to get images of good resolution. Then we must sort the cards a bit because there are some special template like the ones for the Planeswalkers or old edition.

Most common card template

Old card template

Planeswalker

After training on the most common cards template of Magic The Gathering, I get the following results.

Magic The Gathering cards created by DCGAN

If you look carefully, you can see that some cards have mana costs. We can almost read *Creature* or *Instant* on the type line*.* We also have the power and toughness associated to a creature card and an unclear symbol for the expansion sets.

For the neural networks, I resized all the images to 224x224px. It is easier to work with square images. Moreover, it's small enough to fit in my GPU and to blur the text. I also scale the images to the range [-1, 1] because of the Tanh activation function used for the Generator.

Now, I will show you the main part of my DCGAN implementation in Tensorflow. We will start with the Discriminator implementation because it is like implementing an image classifier.

Comments on the Discriminator:

I use five convolutional layers.

For GAN, we need the LeakyReLU activation for all layers to avoid sparse gradient. It is primordial to train the Generator.

I also use batch normalization.

Dropout is needed to avoid the Discriminator to overfit the data.

The goal of the Generator is to produce counterfeit images that is similar to real images. The input ** Z** will be a vector generated from a normal distribution and the output channel will be the depth of the final image : three for RGB image or one for grayscale image.

Comments on the Generator:

I am applying noise before the "deconvolutional" layers

Then I am applying batch normalisation before the activation function

LeakyReLU activation is used with a tiny alpha (0.1) except for the last layer which have a Tanh activation

Dropout allow better realistic image.

Next we will define the loss function. We need labels but these labels are very simple to define. For the Discriminator, we will set 1 (real) for all inputs which come from our dataset and 0 (fake) for those which come from the Generator. For the Generator, the labels is set to 1 because its goal is to fool the Discriminator.

After that, I am using the cross entropy between the Discriminator's predictions and the labels.

Comment on the losses:

The variable

*smooth1*(resp. smooth0) generates just random values which will be subtracted (resp. added) to the labels. This technique is called*Label Smoothing*and is used to prevent the very large gradient signal.

For the training, we use Adam optimizer. The training loop is almost like all the machine learning training part except I unbalanced the training between the Discriminator and the Generator. For each batches I am training the Discriminator until its smaller than 2. I observed than if the Discriminator's loss is too big, then the Generator doesn't make effort to create realistic image. To force the Generator to create better realistic images, for each batches I am also training the Generator until his loss is smaller than the Discriminator's loss.

It is an equilibrium : if one player "win" too often, the other doesn't want to play.

Comments on the training:

To generate the above images I start with 100 epochs then continue for 50 epochs.

Beyond 150 epochs, the system loose its stability.

Additional notes to improve the training:

Some hacks recommend to use an average pooling layer for the Discriminator instead of fully connected layer.

Make the learning rate bigger for the discriminator.

Figuring out the correct training parameters is harsh. Inspiration from the latest papers can be a very good start.

Move dropout between [0.3, 0.5]

Move Momentum between [0.2, 0.5]

The complete implementation can be found on my github.

This post is concise because there are tons of tutorials and resources to get a better understanding of GANs or find practical hacks :

Magic The Gathering cards : https://scryfall.com/

GAN Architecture picture : https://twitter.com/ch402/status/793911806494261248

]]>