MARGO

News

A Gentle Introduction to Reinforcement Learning

Discover how Reinforcement Learning works through a Rock–paper–scissors player

By Mao Feng Data Scientist Consultant at Margo

20/03/2018

Before reading this article, I invite you to have some fun with our Rock-paper-scissors agent Malphago in the iFrame below and try to see whether he gets stronger while playing with you. In the rest of the content, we’ll illustrate the fundamentals of reinforcement learning (RL) with the help of Malphago.

  1. Overview of RL.

Reinforcement learning is considered as the science of decision-making: trying to understand the optimal way to make decisions. Nowadays, it is applied in many different fields and has achieved many distinguished accomplishments. It has been able to:

  • Defeat the world champion at the board game GO.
  • Fly stunt maneuvers in a helicopter.
  • Make a humanoid robot walk.
  • Manage an investment portfolio.

 

RL is situated at the intersection of multiple fields of science as the image below shows:

 

The diagram illustrates that the general decision making problem is actually studied by many of the different fields, as a fundamental sciences across those braches.

What we concern in this article is the technology as a branch of Computer science and Machine Learning.

 

 

Reinforcement learning is an essential component of machine learning, which is intersected with both supervised learning and unsupervised learning. In an RL problem, there is no supervisor, but just reward signals. Our agent takes actions according to the feedback reward and the actions effect the subsequent data it receives. We should notice that there is no i.i.d (independent identically distributed) data as in a normal supervised learning problem. Here, the feedback data can be delayed so that time really matters.

 

  1. What is the problem set?

An RL problem is a sequential decision making problem – controlling an agent to interact with an environment step by step in order to achieve some goal.

Let’s take Malphago as an example. Here the agent is Malphago’s brain and the environment is playing with a human player, Bob. At each step, Malphago executes an action, rock or paper or scissors. Bob receives the action and generates an observation, which is one of all possible situations (rock vs. paper or scissors vs. rock or etc.). The corresponding reward is generated at the same time because we know who wins or the game is tied according to the observation.

 

Reinforcement Learning is based on the reward hypothesis: the goal can be described by the maximization of expected cumulative reward. In short, Malphago is designed to win as many times as possible in the long run but not a single game.

 

Let’s go deeper into some elements in the RL:

  • A reward is a feedback signal, which indicates how well the agent is doing at a step. The agent’s job is to select actions to maximize cumulative reward. In a different pattern of agent/environment interaction, actions may have long-term consequences and the reward may be delayed, so that it may be better to sacrifice immediate reward to gain more long-term reward.
  • The State is the information used to determine what happens next. In our Rock-paper-scissors game, the environment state is fully observable by both Bob and Malphago. In this case we call it a Markov decision process, where we suppose that the future is independent of the past given the present. More precisely, we use the current observation as the current state (the state can be designed to be more complicated), Malphago makes his decision according to the state and continuously improves his strategy of making this decision.

 

  1. Solution methods.

 

What is inside an RL agent? One or more of the following components could play a role:

  • Policy: Malphago’s behavior function. That is, given the current state, what to choose as the next action.
  • Value function: How good is each state and/or action? Is (Rock vs. paper) a good state when playing with Bob or is paper a good action given this state? This function is dedicated to evaluating how much total reward we will get if following a particular policy.
  • Model: Malphago’s perspective of Bob’s strategy. How to predict next state and reward?

 

Malphago applies a so-called Q-learning method to improve its playing strategy. This method is a model-free approach using both value function and policy to build a learning procedure. In other words, Malphago doesn’t care how Bob thinks and what Bob’s next action is. Malphago only evaluates its value function and selects the best choice according to its policy. During the playing process, it learns this function and improves the policy.

 

Here comes the next question, to what extent can we approach to the value function? What if the function is much more complex than we imagined so that we are not able to evaluate a state’s value properly? Deep reinforcement learning is therefore a good option to approximate the value function due to its strong ability of representation. Our Deep Malphago is an example of the implementation of DRL.

 

  1. Conclusion

 

As a conclusion, Reinforcement Learning is a fundamental problem in sequential decision-making: The environment is initially unknown and the agent interacts with the environment to improve its policy.

It is like trial-and-error learning. The agent should discover a good policy during the experiences of interaction with environment without losing too much reward along the way.

Hope you enjoy the game with Malphago and thanks for reading.

 


By Mao Feng Data Scientist Consultant at Margo
Machine Learning
News

Tutorial - Artificial Intelligence: from prototype to deployment

In this article we present a practical case study for building an AI model and deploying it in a mobile application, all in less than an hour.

02/05/2019 Discover 
News

Data Science applied to the retail industry: 10 essential use cases

Data Science is having an increasing impact on business models in all industries, including retail. According to IBM, 62% of retailers say the use of Big Data techniques gives them a serious competitive advantage. Knowing what your customer wants and when, is today at your fingertips thanks to data science. You just need the right tools and the right processes. We present in this article 10 essential applications of data science in the field of retail.

31/05/2018 Discover 
News

Establishment of a centralised log management platform with the Elastic suite

The volume of data generated by our systems and applications continues to grow, resulting in the proliferation of data centers and data storage systems.  In the face of this data explosion and the investment in skills and resources, decision-makers need sophisticated analysis and sophisticated dashboards to help them manage their systems and customers.

14/05/2018 Discover 
News

A brief introduction to Chatbots with Dialogflow

Nowadays, I'm working on a Chatbot project with Google Dialogflow. In this article, I will present some notions about Dialogflow and Chatbots. Then, I will talk about how to create a simple Chatbot with Dialogflow Platform.

07/05/2018 Discover 
News

Exploring Google Cloud Platform

Google offers about 50 different products in its Cloud solution, from storage and computing infrastructure to Machine Learning, including massive data analysis and transformation tools. These solutions are mainly quick to set up (around 10 minutes or less) and cheap compared to standard on premise softwares.

06/03/2018 Discover 
News

Why is Artificial Intelligence so fascinating?

Today, between the huge capacities of the Cloud, the enormous amount of collected data and the algorithms based on neural networks, Artificial Intelligence is not a science-fiction subject anymore. It is disrupting and will disrupt every business sector, from Banking to Transportation and Health. That is why Artificial Intelligence is at the heart of Microsoft’s strategy.

16/10/2017 Discover