Markov decision process, value functions, Bellman equation…

This post will present all we need to understand the mathematical foundations of reinforcement learning. Don’t worry, it is still easy!

As we have seen, three signals are passing back and forth between an agent and its environment :

- the actions which represent the choices made by the agent
- the states which refer to whatever available information for making decisions
- the rewards which define the agent’s goal

You also have two types of tasks: *episodic *one and *continuing* one. An episodic task merely is characterized by the presence of a terminal state, such as a game over. When the agent is in a terminal state, it is the end of an episode; then we must start the agent-environment interaction again to get a new episode. Continuing tasks are tasks that go on without limit.

So far, we know the agent’s goal is to maximize the reward it receives in the long run. Formally what it means?

In an episodic task, we know that the interaction will stop at a time \(T\). Then we can define the *expected return* as the sum of future reward: \(G_t = R_{t+1}+ R_{t+2}+\dots+R_{T}\). However, in continuing task, \(T=\infty\); so the previously expected return is problematic. The solution implies an additional concept called discounting. We will discount distant rewards until zero. Thus, the agent will maximize the expected discounted return.

\(G_t = R_{t+1}+ \gamma R_{t+2}+\gamma^2 R_{t+3}+\dots = \sum_{k=0}^{\infty} \gamma^kR_{t+k+1}\)

where \(\gamma\), the *discount rate*, is a parameter and \(0 \leq \gamma \leq 1\). Discounting deals with the “eyesight” of the agent. If \(\gamma\) is close to 0, it will be only concerned by the immediate reward, whereas for a gamma close to 1, the agent becomes farsighted!

As you can expect, we will use the notion later. It is just one piece of the math puzzle. Here comes another called *Markov decision process*.

First of all, we will consider a recycling robot task to illustrate.

Its goal is to search paper balls in an office. At each time step, it must decide whether it should search paper balls on the floor, wait for someone to bring it a paper ball or go on the charging base. However, it must also consider the state of the battery.

The following transition graph summarizes the dynamic of the problem.

Now how to read it? The agent starts in a state \(s\) and takes an action \(a\). So it follows one of the green lines from state node \(s\) to action state \((s,a)\). Then the environment reacts with a transition to a next state’s node via one of the orange lines leaving action node \((s,a)\).

In reinforcement learning, we suppose that our environment has the *Markov property*. It means that the future is independent of all the previous events and depends only on the current state. Formally, a state \(S_t\) is Markov if and only if \(Pr\{S_{t+1}|S_t\} = Pr\{S_{t+1}|S_0,S_1,\dots, S_t\} \).

Most of the RL tasks are defined as finite Markov decision process, and because of that, we have two other mathematical definitions:

- Given a state \(s\) and an action \(a\), the
*transition probability*of each possible next state is \(p(s’|s,a)=Pr\{S_{t+1}=s’| S_t=s, A_t=a\}\) - Similarly, given a state \(s\) and an action \(a\) and a next state \(s’\), the
*expected value of the next reward*is \(r(s,a,s’)=\mathbb{E}[ R_{t+1} | S_t=s, A_t=a, S_{t+1}=s’]\)

Presently, I labeled the orange arrows with the transition probabilities and the expected reward for each transaction.

To make the point, the agent must maximize the return on the long run, can expect a reward for being in the next state, and we can have the transition probabilities from the state-action node. Ok! But how the agent know what a good state is?

For example, if I play to golf and my ball is in the sand, it is an undesirable situation because I will need a lot of moves to return the ball to the grass.

In RL, we use two *value functions* to express this concept. We can decompose them in terms of expected return \(G_t\) with respect to the policy \(\pi\) (function that maps state into action).

- The value of a state \(s\), \(v_{\pi}(s)\), is the expected return when starting in \(s\) and following the policy thereafter: \(v_{\pi}(s)=\mathbb{E}_{\pi}[G_t | S_t = s]\). Note that for a terminal state, the value is zero.
- We define similarly the action-value function as \(q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_t | S_t = s, A_t=a]\)

These two functions can be estimated from experience because of the *Bellman equation*! By using the previous definitions, the Bellman equation for \(v_{\pi}\) expresses the relationship between the value of a state and the values of its successor states.

\(\displaystyle v_{\pi}(s) = \sum_a \pi(a|s)\sum_{s’}p(s’|s,a)[r(s,a,s’) + \gamma v_{\pi}(s’)\)

In fact, the Bellman equation averages all the future possibilities and weights each by its transition probability. This formula is useful because it is the basis of a several ways to compute, approximate and learn \(v_{\pi}\).

To conclude, we have seen all the fundamental mathematical definitions to move forward. To recall, we defined the expected return \(G_t\), the transition probabilities \(p(s’|s,a)\), the expected value of the next reward \(r(s,a,s’)\), the value functions \(v_{\pi}\) and \(q_{\pi}\) and introduced the Bellman equation.

See you for the next post !

###### Sources

[1] Richard S. Sutton, Andrew G. Barto. *Reinforcement Learning: An Introduction.*