Reinforcement learning, from scratch to Deep Q-Network

In this post, I will gentle introduce reinforcement learning from traditional algorithms to deep Q-Learning. The idea of this post is mostly to give you an overview of the concept behind reinforcement learning to go deeper inside the state of the art.

Let's start with the definition. Reinforcement learning is a type of machine learning where an agent learns how to behave in an environment to maximize its performance at a task. Thus the agent interacts at each time step by receiving a state or observation of its environment and acting in it.

It also gets feedback on its action by a reward.

Reinforcement learning framework

Reinforcement learning framework

The particularity of this framework are:

* The agent learns an optimal policy \(\pi\) (aka behavior) by trial and error.

* It needs feedback on its action

* Its action affect the future state it receives

That's said, let's see a grid world example.

Dinosaur want to go to the beach

Dinosaur want to go to the beach

Here a dinosaur that wants to go to the beach. Our dinosaur starts in state 1 and must go to state 4. Unfortunately, there is a wall between state 1 and 4, so it must figure out how to reaches its goal.

It will be too easy if the world doesn't have its dynamics. For each direction chosen by our dinosaur, there are 70% of chance the environment moves it in that direction and 10% of luck the dinosaur ends up in another direction.

Keep in mind that the environment's dynamics stay hidden from the agent.

World’s dynamics

World’s dynamics

As said earlier, our dinosaur also needs feedback, so we have set the reward to 10 for the beach and -1 for all the other states. We use negative rewards to ensure that the agent will go to the terminal state as quickly as possible.

Grid world reward

Grid world reward

Ok, now we can start to talk about Monte Carlo methods.

# Starting with Monte Carlo methods

We will begin with a random policy to collect episodes, complete sequences of interactions, from start to finish.

Rollouts.png

What to do with it? If we look at the first collected episode, we can now compute the cumulative reward G we get for each state-action pair.

Compute cumulative rewards.png

Then we can start to fill a table of relationships between the states and the actions.

CreateTable.png

With all the collected episodes, we can fill the rest by averaging the cumulative rewards. In the end, a new policy \(\pi^{\prime}\) will emerge by taking the action that maximizes the return.

Q-Table filled.png

More formally, we estimated the action-value function \(q_\pi\) in a table called the Q-Table, where each case contains an expected return.

Until now, we compute the Q-Table after a lot of episodes. However, we can collect one, update it and reuse the policy for the next episodes.

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \frac{1}{N(S_t,A_t)}(G_t - Q(S_t, A_t))$$

\(Q(S_t, A_t)\) is the expected return at time \(t\). \(N(S_t,A_t)\) is the number of update for the \((S_t,A_t)\) pair. \(G_t\) is the cummulative reward.

Nevertheless, we have seen two policies, one random and the other estimated from the Q-Table. Which one to use? If we choose the best action all the time, we can not be sure that our action is really the best. We are in the exploration-exploitation dilemma!

The Exploration-Exploitation dilemma

The role of the exploitation is to maximize the reward based on what the agent already knows. The purpose of the exploration is to improve the agent's knowledge. We can not exploit all the time. We can not explore all the time.

To deal with it, we use an epsilon-greedy policy.

We set an epsilon \(\epsilon\) and choose with a probability of \(\epsilon\) the random policy and with a probability of \(1-\epsilon\) the greedy policy. At start of the learning, \(\epsilon\) is set to 1 then decay until it reaches a small value.

Back to our Monte Carlo methods

For recall, we use Incremental Mean to update our Q-Table gradually with the following equation:

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \frac{1}{N(S_t,A_t)}(G_t - Q(S_t, A_t))$$

The problem is that the 1000th update doesn't count much. For recall, the environment's dynamics are unknown, and we likely prefer to stay up-to-date in our interactions with it. Thus we will change the update factor to be proportional to \(\alpha\) instead of \(\frac{1}{N(S, A)}\).

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \alpha (G_t - Q(S_t, A_t))$$

# From Monte Carlo to Temporal-Difference methods

Unfortunately, Monte Carlo methods can only be used with a complete episode before updating the Q-Table. Moreover, that implies that it can not be used with a non-terminal environment.

It is where Temporal-Difference methods intervene. In TD algorithms, the Q-Table is updated after each step. Because of that, we can not compute the cumulative reward G. It will be replaced by the following equation from the SARSA algorithm.

SARSA_illustation.png

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t))$$

Here \(\gamma\) is called the discount rate. In the previous example, we implicitly used a \(\gamma\) value of 1, because we care about the future. If \(\gamma\) value is 0, then the agent only cares about the immediate reward.

We can do better: update the Q-Table before choosing the next action. Thus we get the Q-learning equation! The idea behind is that the next move will be selected according to the greedy-policy if it's not the random policy.

SARSAmax_illu.png

$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+ \alpha (R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t))$$

# The deep learning attraction

However, Q-Table is not scalable. That's why we use an approximation function as a neural network to estimate the expected return for each action from a state. (Note: for the following, I assume you know how supervised deep learning works)

State_to_action_NN.png

Our update equation become

$$\Delta w = \overbrace{\alpha}^{\text{learning rate}} (\overbrace{R_{t+1} + \gamma \max_a Q(S_{t+1}, a)}^{\text{TD target}} - \overbrace{Q(S_t, A_t)}^{\text{current predicted Q-val}}) \overbrace{\nabla_w Q(S_t, A_t)}^{\text{gradient of the current Q-val}}$$

Sadly, because of the interactive nature of our framework, the neural network has some instability while learning. It can be encountered with two well-known tricks: the experience replay and fixed Q-targets.

The idea of experience replay is straightforward. Instead of forgetting all the interaction our agent have with the environment, we can store it in a queue to sample data when optimizing the neural network. We use a queue to use a queue to save only the most recent interactions. It contains more valuable information that an old random interaction at the start. Think of it as our evolution from childhood to adulthood.

Experience replay.png

The second trick is to fix the Q-targets. Let's look at the equation again.

$$\Delta w = \alpha (R_{t+1} + \gamma \max_a \underbrace{\hat{Q}(S_{t+1}, a)}_{\text{Fixing target by another network}} - Q(S_t, A_t)) \nabla_w Q(S_t, A_t)$$

Our target \(\hat{Q}(S_{t+1}, a)\) is evolving at the same time we try to optimize toward it. The solution is to use another network that doesn't change during the learning step.

In practice, we use two separate networks with identical architecture.

The primary Q-Network is used to be optimized as we saw. The second Q-Network, also called target Q-Network, is used to fix the Q-targets.

This latter is updated less often or more slowly than the primary Q-Network by the following formula.

$$\hat{Q}(s,a) = \tau \cdot \hat{Q}(s,a) + (1-\tau) \cdot Q(s,a)$$

Some ways to improve Deep Q-Network

We have seen the basis of Deep Q Learning, but there are a lot of ways to improve the algorithm. Here some examples:

* Replay Network: Prioritized Experience Replay

* Exploration: Noisy Net

* Optimization: Multi-steps learning or Double Q-learning

* Architecture: Dueling Network

* Distributed learning

# Demo with Unity

To finish, I would like to share with you a demo of a DQN agent learned on a Unity environment have compiled. The agent's goal is to eat the most the yellow bananas in 300 steps. It observes the environments through ray casting and for each banana eaten, a new one appears randomly in the environment

I hope you liked this introduction! Next time, I will show you how to compile a Unity environment and use it with python.

# Some resourses