Exploring Distributional RL on Atari games

I was browsing on the "Key Papers in Deep RL" page on the "Spinning Up" website when I saw the concept of distributional RL. I wanted to know a bit more about it and eventually try it. I also haven't used, until now, an Atari game for learning. So it was two good reasons to start a new project and to share my feedback on it.

Let's start with how the idea of distributional reinforcement learning came about with the Monopoly game. In the figure below, I show you a piece of the board where your token is on Luigi. Somebody else is owning the "Mario" and "Luigi" squares and have put hotels on these properties.

MonopolyNintendoEdition.jpg

Two possible things can happen: you land on Mario, and you have to pay $2,000 in rent, or you pass the GO space and earn some extra cash.

After rolling the dice, you can end up paying a rent or earning some cash.

After rolling the dice, you can end up paying a rent or earning some cash.

In term of reinforcement learning, we'll look at the expected reward that's going to happen in this case.

$$\begin{split}
\mathbb{E}[R(x)] & = \frac{1}{36} \times (-2000) + \frac{35}{36} \times 200 \\
& \approx 139
\end{split}$$

The odds of landing on Mario space is pretty slim, so we end up with a positive expected bonus of $139.

Of course, the RL problem is more complex than just looking at the prediction on the immediate reward. We are going to look at some future interactions, then discount and sum the rewards as we encounter them.

MonopolyNintendoEdition_rollouts.jpg

$$R(x_0) + \gamma R(x_1) + \gamma^2 R(x_2)$$

In the traditional approach, we formalize that we are computing the expected return for the state \(x\) as the expected sum of discounted rewards that's accumulated along the way.

$$
V^\pi(x)  = \mathbb{E} \left[\sum_{t=0} \gamma^t R(x_t) \: \middle| \: x_0 = x\right]
$$

However, if we think back to our situation mentioned above in the Monopoly game, the expected return of $139 will never occur. There is some randomness in these returns, and by modeling as we do, we hide this intrinsic randomness.

# The idea of distribution RL

The work of Bellemare, Dabney et al. were to get rid off the expectation in the Bellman equation to work on the full distribution of the return. It also means to define a random variable \(Z\). Our distributional Bellman equation is almost the same, except that we have three random variables: the reward \(R\), the next state \(x'\) and its random return \(Z(x')\).

$$Z^\pi(x) \stackrel{D}{=} R(x) + \gamma Z^\pi (X') \quad \quad X' \backsim P^\pi( \cdot |x)$$

Three random variables: the reward \(R\), the next state (here \(d\) or \(e\)) and the return distribution of each of the next state.

Three random variables: the reward \(R\), the next state (here \(d\) or \(e\)) and the return distribution of each of the next state.

Note: For the following, I will now use the random variable \(Z(x, a)\) which is the return distribution obtained in state \(x\) and performing the action \(a\). It implies each action has its distribution that we will optimize.

# Characterizing a distribution

Since we will work with distributions, we have to choose how we will represent them. Bellemare, Dabney and al released three papers since 2017 about their work on three different distribution representations: they started with the categorical description, then used the inverse cumulative density function with specific quantiles, and went further by adding a parametric probability in their model.

Categorical distribution representation (left) and cumulative density function (right)

Categorical distribution representation (left) and cumulative density function (right)

# Categorical distributional Deep Q-Network

Let's see the categorical distributional deep Q-learning quickly. Here, we will define the support for our distribution \(Z\) such as \( {z_0, z_1, \cdots, z_{n_atoms}} \). In DQN, we were computing for each action. It will be quite the same here: we will calculate a discrete distribution for each action. However, it means we have to choose a probability distance for the loss. In the paper, they use the KL divergence since the support is fixed. They also obtained their best result on Atari using 51 atoms, hence the name of the algorithm, C51.

Categorical DQN, the support is fixed and we learn a distribution for each action.

Categorical DQN, the support is fixed and we learn a distribution for each action.

# Quantile Regression DQN

I was first intrigued by quantile regression when I saw the name on OpenAI SpinningUp website and it is what motivate me to do this post. Here we do the opposite of the categorical DQN: the support is variable, and we fixed the probabilities. More precisely, we cut the distribution into \(N\) portions of equal probabilities mass or quantiles, and we define the quantile target \(\hat\tau\) as the median of each quantile \(\tau\). In the illustration below, I divided the distribution into four slices and places \(\hat\tau\), as the median of each slice.

Cumulative distribution function (CDF) of the predicted and target distribution

Cumulative distribution function (CDF) of the predicted and target distribution

We trained the neural networks using the Huber quantile regression loss defined as follow:

$$\rho_\tau(\delta_t) = | \tau - \mathbb{I}\{\delta_t< 0\}| \cdot \underbrace{\mathcal{L}(\delta_t)}_{\text{Huber loss}}$$

$$\mathcal{L}(\delta_t) = \begin{cases} \frac{1}{2}\delta_t^2 &\quad \text{if } |\delta_t| < 1 \\|\delta_t| - \frac{1}{2} &\quad \text{otherwise } \end{cases}$$

$$ \delta_t = r + \gamma Z'(x_{t+1}, a^*)) - Z(x_t,a_t) $$

The Huber loss is a smooth L1 loss. You can spot the difference in the figure below.

L1 loss (orange) and the Huber loss (blue)

L1 loss (orange) and the Huber loss (blue)

# Implementation of QR-DQN

When I started implementing the quantile regression DQN, I choose to test it on the Lunar Lander environment, available in OpenGym. The goal is simple, and the state representation is a vector containing the velocity of the object, its angle and angular velocity, if its legs are touching the ground, etc.

The Lunar Lander environment

One issue I faced is that in the loss, I was making a simple vector difference between the output of my network and the one of the target network. Let's say that \(\theta_0, \dots, \theta_n\) are the atoms returned by the network for an action \(a\) and \(\theta'_0, \dots, \theta'_n \) those of the target network.

\(\theta'_i - \theta_i\) doesn't work because there is no guarantee that the output is ordered, i.e., that at the index \(i\), it corresponds to the same quantile. It is mainly the case when we start training. Instead, we must compare each atom against each other: for each \(j\) from 0 to \(N\): \( \theta'_i - \theta_j\).

During training, the atoms will be moved to the right place by following the indication of \(\tau\) present in the loss. 

Learned distribution for each action in two different situations.

In the figure above, you can visualize the CDF for each action in two situations. I used five points to describe the distribution over returns. Some distributions have a support on the return space very tight. It could mean that the agent is more confident on the possible outcomes.

# Doing the same on an Atari game

After seeing the result on Lunar Lander,  I thought my implementation was good enough to use it on Atari. Papers are benchmarking on this platform so it will be an excellent mean to check if I am right. At the start, I thought l have just a few modifications to do to test it on Pong or Space Invaders. It didn't work out as I expected. In fact, after hours trying to debug, I learned the existence of wrappers used over the Atari games when RL researchers are experimenting on it.

For examples:

  • EpisodicLifeEnv: Convert individual lives in the game into separate episodes.

  • FireResetEnv: Pressing FIRE at the beginning of the game

  • ProcessFrame84: Convert input observation to grayscale 84x84

  • ClipRewardEnv: Clipping the reward to -1, 0 and 1 values. Why? The obtained score can vary among the games.

I also got memory issues, mostly because I was stocking a lot more data in the replay buffer. In the Lunar Lander, the state vector has eight values, that I stack with several past to augment the state representation. Passing to Atari games, the vector space is now images and takes a lot more space, especially if I expand the state representation with the past images.

Hyperparameters has to change too because the settings were reframed.

# Starting over on a better base

After all these issues, I think to start over on a better base: this time, I use a reinforcement learning toolkit, named ptan, to focus on more on the quantile regression concept and less on core implementation. I also started to use W&B, to save experiments progress. I use the Pong environment to try out this new implementation. Finally, I didn't want to wait too long the end of a run, so I took an interest in DQN extensions to speed up the learning.

Screenshot of my wandb dashboard with the summary plotting and the hyperparameters

Screenshot of my wandb dashboard with the summary plotting and the hyperparameters

# Speed up and DQN extensions

For the first speed up, I followed the first suggestion of this blog. The trick is to set a train frequency: a batch is seen all the \(N\) frames instead of each frame. However, I have to tweak the hyperparameters carefully. To get the same thing as if I trained each time, I multiplied the batch size, the learning rate, and the other optimizer hyperparameters by the train frequency I have set.

Then I use an extension called N Steps, where we look at the reward for \(N\) consequent steps. In classical DQN, we are just updating with immediate reward, and here we are using the reward of two or three steps ahead.

$$\begin{split} Q(s_t, a_t) & = r_t + \gamma \max_a Q(s_{t+1}, a_{t+1}) \\ & = r_t + \gamma \max_a[r_{a,t+1} + \gamma \max_{a'} Q(s_t+2, a')] \\ & = r_t + \gamma r_{t+1} + \gamma^2 \max_{a'} Q(s_{t+2}, a') \end{split}$$

The last extension I used is Double DQN. What changed is we choose an action for the next state using the trained network but still take value from the target network.

$$ \begin{split} Q(s_t, a_t) & = r_t + \gamma \max_a Q'(s_{t+1}, a_{t+1}) \\ & = r_t + \gamma \max_a Q'(s_{t+1}, \mathrm{arg}\max_a Q(s_{t+1}, a)) \\ \end{split}$$

Speed up methods is good, but hyperparameters impact a lot on the convergence time. For that, I use those from this paper, that writes down all the hyperparameters for DQN, Rainbow, C51 and such in the Annexe section.

# Result on Freeway environment

Now that all is set, let’s visualize on the Freeway environment. Freeway is a game where the agent control a chicken that want to cross the highway. The chicken can only move up or down. We get a point when the chicken is on the other side of the road. You can see the network in action with the following video.

The code can be found on my github if you are interested.

# Going further: Implicit Quantile Network

I didn’t implement this one, but the idea extends the Quantile Regression network described above. In the previous algorithm, \(\tau\) was fixed whereas, in the implicit quantile network (IQN), we add \(\tau\) as input in the network. The latter will be sampled from a uniform distribution between 0 and 1. We also come back to a vector of the action size as in DQN instead of a matrix like in QR-DQN. During training and for selecting an action, we will sample several times \(\tau\) to get the shape of the state-action return distribution.

IQN network architecture with parametric \(\tau\)

IQN network architecture with parametric \(\tau\)

The thing I like in this paper is the way where create an embedding for \(\tau\). Rather than use an MLP embedding, they augment the \(\tau\) value with a cosine basis function:

$$ \phi(\tau) = \text{ReLU}(\sum_{i=0}^{n-1} \cos(\pi i \tau)w_i + b) $$

with \(n\) the embedding size.

# Conclusion

In this project, we have seen a lot: how DQN was extend to distributional reinforcement learning with different distribution representations. If you use the Atari environment, don’t forget the wrappers. I waste a lot of time trying to debug and it was mainly the setting of the environment.

Below I share a few useful links I gathered and you can also find the code I use here.

# Bibliography

# Credit

Banner image: Atari 2600 Gems by Kreg Steppe / CC BY-SA 2.0

Found an article interesting and/or useful? Please consider following me to be alerted when I post new content!

You can also support me with a retweet!