A Gentle Introduction to Reinforcement Learning

Discover how Reinforcement Learning works through a Rock–paper–scissors player

By Mao Feng Data Scientist Consultant at Margo


Before reading this article, I invite you to have some fun with our Rock-paper-scissors agent Malphago in the iFrame below and try to see whether he gets stronger while playing with you. In the rest of the content, we’ll illustrate the fundamentals of reinforcement learning (RL) with the help of Malphago.

  1. Overview of RL.

Reinforcement learning is considered as the science of decision-making: trying to understand the optimal way to make decisions. Nowadays, it is applied in many different fields and has achieved many distinguished accomplishments. It has been able to:

  • Defeat the world champion at the board game GO.
  • Fly stunt maneuvers in a helicopter.
  • Make a humanoid robot walk.
  • Manage an investment portfolio.


RL is situated at the intersection of multiple fields of science as the image below shows:


The diagram illustrates that the general decision making problem is actually studied by many of the different fields, as a fundamental sciences across those braches.

What we concern in this article is the technology as a branch of Computer science and Machine Learning.



Reinforcement learning is an essential component of machine learning, which is intersected with both supervised learning and unsupervised learning. In an RL problem, there is no supervisor, but just reward signals. Our agent takes actions according to the feedback reward and the actions effect the subsequent data it receives. We should notice that there is no i.i.d (independent identically distributed) data as in a normal supervised learning problem. Here, the feedback data can be delayed so that time really matters.


  1. What is the problem set?

An RL problem is a sequential decision making problem – controlling an agent to interact with an environment step by step in order to achieve some goal.

Let’s take Malphago as an example. Here the agent is Malphago’s brain and the environment is playing with a human player, Bob. At each step, Malphago executes an action, rock or paper or scissors. Bob receives the action and generates an observation, which is one of all possible situations (rock vs. paper or scissors vs. rock or etc.). The corresponding reward is generated at the same time because we know who wins or the game is tied according to the observation.


Reinforcement Learning is based on the reward hypothesis: the goal can be described by the maximization of expected cumulative reward. In short, Malphago is designed to win as many times as possible in the long run but not a single game.


Let’s go deeper into some elements in the RL:

  • A reward is a feedback signal, which indicates how well the agent is doing at a step. The agent’s job is to select actions to maximize cumulative reward. In a different pattern of agent/environment interaction, actions may have long-term consequences and the reward may be delayed, so that it may be better to sacrifice immediate reward to gain more long-term reward.
  • The State is the information used to determine what happens next. In our Rock-paper-scissors game, the environment state is fully observable by both Bob and Malphago. In this case we call it a Markov decision process, where we suppose that the future is independent of the past given the present. More precisely, we use the current observation as the current state (the state can be designed to be more complicated), Malphago makes his decision according to the state and continuously improves his strategy of making this decision.


  1. Solution methods.


What is inside an RL agent? One or more of the following components could play a role:

  • Policy: Malphago’s behavior function. That is, given the current state, what to choose as the next action.
  • Value function: How good is each state and/or action? Is (Rock vs. paper) a good state when playing with Bob or is paper a good action given this state? This function is dedicated to evaluating how much total reward we will get if following a particular policy.
  • Model: Malphago’s perspective of Bob’s strategy. How to predict next state and reward?


Malphago applies a so-called Q-learning method to improve its playing strategy. This method is a model-free approach using both value function and policy to build a learning procedure. In other words, Malphago doesn’t care how Bob thinks and what Bob’s next action is. Malphago only evaluates its value function and selects the best choice according to its policy. During the playing process, it learns this function and improves the policy.


Here comes the next question, to what extent can we approach to the value function? What if the function is much more complex than we imagined so that we are not able to evaluate a state’s value properly? Deep reinforcement learning is therefore a good option to approximate the value function due to its strong ability of representation. Our Deep Malphago is an example of the implementation of DRL.


  1. Conclusion


As a conclusion, Reinforcement Learning is a fundamental problem in sequential decision-making: The environment is initially unknown and the agent interacts with the environment to improve its policy.

It is like trial-and-error learning. The agent should discover a good policy during the experiences of interaction with environment without losing too much reward along the way.

Hope you enjoy the game with Malphago and thanks for reading.


By Mao Feng Data Scientist Consultant at Margo
Machine Learning