This year, I applied for the IGGI programme and submitted a PhD proposal titled Transfer in Deep Reinforcement Learning for General Video Game Playing. Although I was in the short-list with other good candidates, I didn’t get a scholarship to work on my project.

Nevertheless, my work can still inspire others. So for those who are curious, you can find it here.

Markov decision process, value functions, Bellman equation…
This post will present all we need to understand the mathematical foundations of reinforcement learning. Don’t worry, it is still easy!

As we have seen, three signals are passing back and forth between an agent and its environment :

the actions which represent the choices made by the agent

the states which refer to whatever available information for making decisions

the rewards which define the agent’s goal

You also have two types of tasks: episodic one and continuing one. An episodic task merely is characterized by the presence of a terminal state, such as a game over. When the agent is in a terminal state, it is the end of an episode; then we must start the agent-environment interaction again to get a new episode. Continuing tasks are tasks that go on without limit.

So far, we know the agent’s goal is to maximize the reward it receives in the long run. Formally what it means?

In an episodic task, we know that the interaction will stop at a time \(T\). Then we can define the expected return as the sum of future reward: \(G_t = R_{t+1}+ R_{t+2}+\dots+R_{T}\). However, in continuing task, \(T=\infty\); so the previously expected return is problematic. The solution implies an additional concept called discounting. We will discount distant rewards until zero. Thus, the agent will maximize the expected discounted return.

where \(\gamma\), the discount rate, is a parameter and \(0 \leq \gamma \leq 1\). Discounting deals with the “eyesight” of the agent. If \(\gamma\) is close to 0, it will be only concerned by the immediate reward, whereas for a gamma close to 1, the agent becomes farsighted!

As you can expect, we will use the notion later. It is just one piece of the math puzzle. Here comes another called Markov decision process.

First of all, we will consider a recycling robot task to illustrate.

Its goal is to search paper balls in an office. At each time step, it must decide whether it should search paper balls on the floor, wait for someone to bring it a paper ball or go on the charging base. However, it must also consider the state of the battery.

The following transition graph summarizes the dynamic of the problem.

Now how to read it? The agent starts in a state \(s\) and takes an action \(a\). So it follows one of the green lines from state node \(s\) to action state \((s,a)\). Then the environment reacts with a transition to a next state’s node via one of the orange lines leaving action node \((s,a)\).

In reinforcement learning, we suppose that our environment has the Markov property. It means that the future is independent of all the previous events and depends only on the current state. Formally, a state \(S_t\) is Markov if and only if \(Pr\{S_{t+1}|S_t\} = Pr\{S_{t+1}|S_0,S_1,\dots, S_t\} \).

Most of the RL tasks are defined as finite Markov decision process, and because of that, we have two other mathematical definitions:

Given a state \(s\) and an action \(a\), the transition probability of each possible next state is \(p(s’|s,a)=Pr\{S_{t+1}=s’| S_t=s, A_t=a\}\)

Similarly, given a state \(s\) and an action \(a\) and a next state \(s’\), the expected value of the next reward is \(r(s,a,s’)=\mathbb{E}[ R_{t+1} | S_t=s, A_t=a, S_{t+1}=s’]\)

Presently, I labeled the orange arrows with the transition probabilities and the expected reward for each transaction.

To make the point, the agent must maximize the return on the long run, can expect a reward for being in the next state, and we can have the transition probabilities from the state-action node. Ok! But how the agent know what a good state is?
For example, if I play to golf and my ball is in the sand, it is an undesirable situation because I will need a lot of moves to return the ball to the grass.

In RL, we use two value functions to express this concept. We can decompose them in terms of expected return \(G_t\) with respect to the policy \(\pi\) (function that maps state into action).

The value of a state \(s\), \(v_{\pi}(s)\), is the expected return when starting in \(s\) and following the policy thereafter: \(v_{\pi}(s)=\mathbb{E}_{\pi}[G_t | S_t = s]\). Note that for a terminal state, the value is zero.

We define similarly the action-value function as \(q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_t | S_t = s, A_t=a]\)

These two functions can be estimated from experience because of the Bellman equation! By using the previous definitions, the Bellman equation for \(v_{\pi}\) expresses the relationship between the value of a state and the values of its successor states.

In fact, the Bellman equation averages all the future possibilities and weights each by its transition probability. This formula is useful because it is the basis of a several ways to compute, approximate and learn \(v_{\pi}\).

To conclude, we have seen all the fundamental mathematical definitions to move forward. To recall, we defined the expected return \(G_t\), the transition probabilities \(p(s’|s,a)\), the expected value of the next reward \(r(s,a,s’)\), the value functions \(v_{\pi}\) and \(q_{\pi}\) and introduced the Bellman equation.

See you for the next post !

Sources

[1] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction.

Today, I wanted to present you the reinforcement learning. This framework is meant to resolve goal-oriented problem by interaction with an environment. Thus, the learner must discover which actions will allow it to achieve a goal.

I will introduce you a simple example where you will be the learner. Have you ever tried this hidden game in the Chrome browser?

If not, turn off your Wi-Fi, try a web page and when the dinosaur appears, press the spacebar. It is a platform game, and you will immediately try to get your best score. For this, you will observe the dinosaur running, and press the spacebar to avoid the cactus. If you do so, your score continues to increase. It is your reward for choosing the right action at the right time.

As you see, if we replace you by an artificial intelligence, there is no supervisor in reinforcement learning! Instead, the agent (our A.I.) has the reward signal from what it can judge the progress towards its goal. In this game, we have the reward quickly, but in many situations, this feedback is delayed (think about a puzzle game). So, the time matters in two cases: to predict the reward the agent will receive and to decide the right action at the right time. It implies to take into account the delayed consequences of actions.

So much to digest! I will summarize with this schema :

At time t, the agent observes the environment which at a state \(s_t\). It gets an observation \(o_t\). Then it executes an action \(a_t\) in its environment and gets a reward \(r_t\). This reward can be negative. Its role is to say what the good and bad events are for the agent. But, the agent rarely reaches its goal the first try. Hence, it will use its experience to adjust its actions during the next try. We let it do this trial-and-error process until it comes to its goal.

You see! Easy! Now I will go one step further into a reinforcement learning system. In the following, we suppose that the agent can totally observe its environment; thus \(o_t=s_t\). Besides the elements aforementioned, there are three main components: the policy, the value function and the model.

The policy defines how the agent behaves. It maps each state to action. It is the core of a reinforcement learning agent. I will write more about it in another post.

The value function defines how good each perceived state is. It estimates the total amount of rewards the agent can expect in the future. Note that the agent will reevaluate this value all its lifetime.

The model is optional and is used for planning. It will predict the next state of the environment and the next immediate reward. However, we can also design the model to decide a course of actions by considering possible future situations.

I hope that all this is clear! You will need it for the following! With these three components, we can define several kinds of agents:

The value-based agent uses only the value function for learning. The policy is implicit; it chooses the action that gives the more estimated reward.

The policy-based agent uses only the policy component. You can think about selecting a direction to leave a maze as an example of behavior.

The actor-critic agent uses both policy and value function.

The type mentioned above is model-free; no model is involved. Model-based agent work with the three components.

At this moment, you have all the knowledge required for being introduced to the exploitation-exploration dilemma! Imagine you are out in the city and you are hungry. You have the choice of several restaurants. If you choose your favorite restaurant, you are using your own experience. You know the food is delicious at this restaurant, and you will get what you expect. Yet, if you want something new, you will try a new restaurant. Maybe this new restaurant will become your new favorite! As you have probably guessed, the first situation is called exploitation because it uses known information to maximize reward. The second one is called exploration because it collects experience and knowledge of the surroundings.

In summary, all reinforcement learning agents have explicit goals, can sense all or partially their environments, and can choose actions to influence their world. It emphasis interaction without relaying an exemplary supervision and this why it is a fascinating paradigm.

See you for the next post !

Sources

[1] Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction.