Today, I wanted to present you the reinforcement learning. This framework is meant to resolve goal-oriented problem by interaction with an environment. Thus, the learner must discover which actions will allow it to achieve a goal.
I will introduce you a simple example where you will be the learner. Have you ever tried this hidden game in the Chrome browser?
If not, turn off your Wi-Fi, try a web page and when the dinosaur appears, press the spacebar. It is a platform game, and you will immediately try to get your best score. For this, you will observe the dinosaur running, and press the spacebar to avoid the cactus. If you do so, your score continues to increase. It is your reward for choosing the right action at the right time.
As you see, if we replace you by an artificial intelligence, there is no supervisor in reinforcement learning! Instead, the agent (our A.I.) has the reward signal from what it can judge the progress towards its goal. In this game, we have the reward quickly, but in many situations, this feedback is delayed (think about a puzzle game). So, the time matters in two cases: to predict the reward the agent will receive and to decide the right action at the right time. It implies to take into account the delayed consequences of actions.
So much to digest! I will summarize with this schema :
At time t, the agent observes the environment which at a state \(s_t\). It gets an observation \(o_t\). Then it executes an action \(a_t\) in its environment and gets a reward \(r_t\). This reward can be negative. Its role is to say what the good and bad events are for the agent. But, the agent rarely reaches its goal the first try. Hence, it will use its experience to adjust its actions during the next try. We let it do this trial-and-error process until it comes to its goal.
You see! Easy! Now I will go one step further into a reinforcement learning system. In the following, we suppose that the agent can totally observe its environment; thus \(o_t=s_t\). Besides the elements aforementioned, there are three main components: the policy, the value function and the model.
- The policy defines how the agent behaves. It maps each state to action. It is the core of a reinforcement learning agent. I will write more about it in another post.
- The value function defines how good each perceived state is. It estimates the total amount of rewards the agent can expect in the future. Note that the agent will reevaluate this value all its lifetime.
- The model is optional and is used for planning. It will predict the next state of the environment and the next immediate reward. However, we can also design the model to decide a course of actions by considering possible future situations.
I hope that all this is clear! You will need it for the following! With these three components, we can define several kinds of agents:
- The value-based agent uses only the value function for learning. The policy is implicit; it chooses the action that gives the more estimated reward.
- The policy-based agent uses only the policy component. You can think about selecting a direction to leave a maze as an example of behavior.
- The actor-critic agent uses both policy and value function.
The type mentioned above is model-free; no model is involved. Model-based agent work with the three components.
At this moment, you have all the knowledge required for being introduced to the exploitation-exploration dilemma! Imagine you are out in the city and you are hungry. You have the choice of several restaurants. If you choose your favorite restaurant, you are using your own experience. You know the food is delicious at this restaurant, and you will get what you expect. Yet, if you want something new, you will try a new restaurant. Maybe this new restaurant will become your new favorite! As you have probably guessed, the first situation is called exploitation because it uses known information to maximize reward. The second one is called exploration because it collects experience and knowledge of the surroundings.
In summary, all reinforcement learning agents have explicit goals, can sense all or partially their environments, and can choose actions to influence their world. It emphasis interaction without relaying an exemplary supervision and this why it is a fascinating paradigm.
See you for the next post !
 Richard S. Sutton, Andrew G. Barto. Reinforcement Learning: An Introduction.