Tutorial 1: Basic Reinforcement Learning¶

Week 3, Day 4: Basic Reinforcement Learning

By Neuromatch Academy

Content creators: Pablo Samuel Castro

Content reviewers: Shaonan Wang, Xiaomei Mi, Julia Costacurta, Dora Zhiyu Yang, Adrita Das

Content editors: Shaonan Wang

Production editors: Spiros Chavlis

Tutorial Objectives¶

Reinforcement Learning (RL) is a powerful framework for defining and solving problems where an agent learns to take actions that maximize its reward. Essentially, an agent observes the current state of the world, selects an action, receives a reward, and uses this feedback to improve its future actions. RL provides a formal, optimal way of describing this learning process, which was initially derived from studies of animal behavior and later validated by observations of the brain in humans and animals.

This tutorial will introduce you to the basic concepts of RL using a simple example. By the end, you’ll have a better understanding of how RL works and how it can be applied to solve a wide range of problems.

Setup¶

This is a GPU free notebook!

⚠ Experimental LLM-enhanced tutorial ⚠

This notebook includes Neuromatch’s experimental Chatify 🤖 functionality. The Chatify notebook extension adds support for a large language model-based “coding tutor” to the materials. The tutor provides automatically generated text to help explain any code cell in this notebook.

Note that using Chatify may cause breaking changes and/or provide incorrect or misleading information. If you wish to proceed by installing and enabling the Chatify extension, you should run the next two code blocks (hidden by default). If you do not want to use this experimental version of the Neuromatch materials, please use the stable materials instead.

To use the Chatify helper, insert the %%explain magic command at the start of any code cell and then run it (shift + enter) to access an interface for receiving LLM-based assitance. You can then select different options from the dropdown menus depending on what sort of assitance you want. To disable Chatify and run the code block as usual, simply delete the %%explain command and re-run the cell.

Note that, by default, all of Chatify’s responses are generated locally. This often takes several minutes per response. Once you click the “Submit request” button, just be patient– stuff is happening even if you can’t see it right away!

Thanks for giving Chatify a try! Love it? Hate it? Either way, we’d love to hear from you about your Chatify experience! Please consider filling out our brief survey to provide feedback and help us make Chatify more awesome!

Run the next two cells to install and configure Chatify…

%pip install -q davos
import davos
davos.config.suppress_stdout = True

Note: you may need to restart the kernel to use updated packages.

smuggle chatify      # pip: git+https://github.com/ContextLab/chatify.git
%load_ext chatify

Using default configuration!
Downloading the 'cache' file.

Install and import feedback gadget¶

# @title Install and import feedback gadget

!pip3 install vibecheck datatops --quiet

from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
        notebook_section,
        {
            "url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
            "name": "neuromatch_dl",
            "user_key": "f379rz8y",
        },
    ).render()


feedback_prefix = "W3D4_T1"

# Imports
import numpy as np
import matplotlib.pyplot as plt
from typing import Optional, Tuple

Section 1: A history of RL¶

In this section, we will briefly overview the history of reinforcement learning (RL) in reverse chronological order. This will help motivate why RL is an interesting topic to study!

Video 1: Intro to RL¶

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Intro_to_RL_Video")

Section 2: What is RL¶

Using a very simple problem, we will give a high-level overview of what RL is and what are the main components that define the problem formulation.

Extra: If you’d like to read more, the canonical reference for RL is Sutton & Barto’s Reinforcement Learning book.

Section 2.1: Grid World¶

GridWorlds are very simple “navigation” problems that can be very useful for motivating RL problems and solutions. They are commonly used in RL research, so it’s a good idea to get familiar with them!

We will use a simple GridWorld problem throughout this tutorial: an empty room with a reward at one corner.

An example below defines a second GridWorld that is a little more difficult. Feel free to create your own!

Extra: If you’d like to play with RL in GridWorlds on the web, you can check out this GridWorld playground web app.

Video 2: Grid World¶

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Grid_world_Video")

Coding Exercise 1: Code a shortest-path planner for GridWorld¶

Create the GridWorldPlanner object (defaults to simple example)¶

# @title Create the GridWorldPlanner object (defaults to simple example)

ASCII_TO_EMOJI = {
    ' ': '⬜',
    '*': '⬛',
    'g': '⭐',
    '<': '◀️',
    '>': '▶️',
    'v': '🔽',
    '^': '🔼',
}

ACTIONS = ['<', '>', 'v', '^']
ACTION_EFFECTS = {  # Position effects of each action.
    '<': (0, -1),
    '>': (0, 1),
    'v': (1, 0),
    '^': (-1, 0),
}


def get_emoji(c, policy=None):
  assert c in ASCII_TO_EMOJI
  if policy is not None and c != 'g':
    assert policy in ASCII_TO_EMOJI
    if policy != ' ':  # If there is a policy, use this instead.
      c = policy
  return ASCII_TO_EMOJI[c]


class GridWorldBase(object):
  """Defines a GridWorldPlanner object."""

  def __init__(self, world_spec: Optional[np.ndarray] = None):
    """Creates a GridWorld object with an empty policy.

    Args:
      world_spec: Optional array specification of GridWorld. If None, will
                  use default square room.
    """
    if world_spec is None:
      self.world_spec = np.array(
          [['*', '*', '*', '*', '*', '*'],
           ['*', ' ', ' ', ' ', ' ', '*'],
           ['*', ' ', ' ', ' ', ' ', '*'],
           ['*', ' ', ' ', ' ', ' ', '*'],
           ['*', ' ', ' ', ' ', 'g', '*'],
           ['*', '*', '*', '*', '*', '*']]
      )
    else:
      assert len(world_spec.shape) == 2
      self.world_spec = world_spec

    assert len(np.where(self.world_spec == 'g')[0]) == 1  # Only one goal.
    self.policy = np.full_like(self.world_spec, ' ')
    # **Note**: These may be useful for your planner!
    self.goal_cell = [x[0] for x in np.where(self.world_spec == 'g')]

  def get_neighbours(self, cell: Tuple[int, int]):
    """Get the neighbours of a cell.

    **Note**: You should use this when writing your planner!

    Args:
      cell: cell position.

    Returns:
      Dict containing neighbouring cells for each of the 4 possible directions.
    """
    height, width = self.world_spec.shape
    i, j = cell
    if i < 1 or i >= height or j < 1 or j >= width:
      raise ValueError(f'Invalid cell position: {cell}')
    neighbours = {}
    for a in ACTIONS:
      delta = ACTION_EFFECTS[a]
      neighbour_pos = [i + delta[0], j + delta[1]]
      if (neighbour_pos[0] < 0 or neighbour_pos[1] < 0 or
          neighbour_pos[0] >= height or neighbour_pos[1] >= width or
          self.world_spec[neighbour_pos[0], neighbour_pos[1]] == '*'):
        # Remain in same cell
        neighbours[a] = cell
      else:
        neighbours[a] = neighbour_pos
    return neighbours

  def plan(self):
    """Constructs a random policy.

    **Note**: you will make something better further down!
    """
    for i in range(self.policy.shape[0]):
      for j in range(self.policy.shape[1]):
        if self.world_spec[i, j] == '*':  # Nothing to do for walls.
          continue
        self.policy[i, j] = ACTIONS[np.random.choice(len(ACTIONS))]

  def draw(self, include_policy: bool = False):
    """Draw the grid, and (optionally) include the policy."""
    for i in range(len(self.world_spec)):
      row_range = range(len(self.world_spec[i]))
      if include_policy:
        row_chars = [get_emoji(self.world_spec[i, j], self.policy[i, j]) for j in row_range]
      else:
        row_chars = [get_emoji(self.world_spec[i, j], None) for j in row_range]
      print(''.join(row_chars))


gwb = GridWorldBase()
print('Simple GridWorld:')
gwb.draw()
gwb.plan()
print('Random policy:')
gwb.draw(True)

Simple GridWorld:
⬛⬛⬛⬛⬛⬛
⬛⬜⬜⬜⬜⬛
⬛⬜⬜⬜⬜⬛
⬛⬜⬜⬜⬜⬛
⬛⬜⬜⬜⭐⬛
⬛⬛⬛⬛⬛⬛
Random policy:
⬛⬛⬛⬛⬛⬛
⬛🔼🔽🔽▶️⬛
⬛▶️🔽🔽▶️⬛
⬛◀️◀️🔼🔼⬛
⬛🔽▶️▶️⭐⬛
⬛⬛⬛⬛⬛⬛

class GridWorldPlanner(GridWorldBase):
  """A GridWorld that finds a better policy."""

  def plan(self):
    """Define a better planner!

    This gives you a starting point by setting the proper actions in the cells
    surrounding the goal cell.

    **Assignment:** Do the rest!
    """
    super().plan()
    goal_queue = [self.goal_cell]
    goals_done = set()
    goal = goal_queue.pop(0)  # pop from front of list
    goal_neighbours = self.get_neighbours(goal)
    goals_done.add(tuple(goal))

    for a in goal_neighbours:
      nbr = tuple(goal_neighbours[a])
      if nbr == goal:
        continue
      if nbr not in goals_done:
        if a == '<':
          self.policy[nbr[0], nbr[1]] = '>'
        elif a == '>':
          self.policy[nbr[0], nbr[1]] = '<'
        elif a == '^':
          self.policy[nbr[0], nbr[1]] = 'v'
        else:
          self.policy[nbr[0], nbr[1]] = '^'
        goal_queue.append(nbr)


gwp = GridWorldPlanner()
print('Simple GridWorld:')
gwp.draw()
gwp.plan()
print('Better policy:')
gwp.draw(True)

Simple GridWorld:
⬛⬛⬛⬛⬛⬛
⬛⬜⬜⬜⬜⬛
⬛⬜⬜⬜⬜⬛
⬛⬜⬜⬜⬜⬛
⬛⬜⬜⬜⭐⬛
⬛⬛⬛⬛⬛⬛
Better policy:
⬛⬛⬛⬛⬛⬛
⬛◀️🔼🔽🔽⬛
⬛🔼🔽🔽▶️⬛
⬛🔽🔽◀️🔽⬛
⬛🔼▶️▶️⭐⬛
⬛⬛⬛⬛⬛⬛

Make a better planner!

class GridWorldPlanner(GridWorldBase):
  """A GridWorld that finds a better policy."""

  def plan(self):
    """Define a better planner!

    This gives you a starting point by setting the proper actions in the cells
    surrounding the goal cell.

    **Assignment:** Do the rest!
    """
    super().plan()
    goal_queue = [self.goal_cell]
    goals_done = set()
    #################################################
    # Implement a better planer
    raise NotImplementedError("Define a better planner!`")
    #################################################
    while ...:
      goal = goal_queue.pop(0)  # pop from front of list
      goal_neighbours = self.get_neighbours(goal)
      goals_done.add(tuple(goal))
      for a in goal_neighbours:
        nbr = tuple(goal_neighbours[a])
        if nbr == goal:
          continue
        if nbr not in goals_done:
          if a == '<':
            self.policy[nbr[0], nbr[1]] = '>'
          elif a == '>':
            self.policy[nbr[0], nbr[1]] = '<'
          elif a == '^':
            self.policy[nbr[0], nbr[1]] = 'v'
          else:
            self.policy[nbr[0], nbr[1]] = '^'
          goal_queue.append(nbr)

Click for solution

gwp = GridWorldPlanner()
print('Simple GridWorld:')
gwp.draw()
gwp.plan()
print('Better policy:')
gwp.draw(True)

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Make_a_better_planner_Exercise")

Try it out in a harder problem.¶

# @title Try it out in a harder problem.
harder_grid = np.array(
    [['*', '*', '*', '*', '*', '*', '*', '*', '*'],
     ['*', ' ', ' ', ' ', '*', ' ', ' ', 'g', '*'],
     ['*', ' ', ' ', ' ', '*', ' ', ' ', ' ', '*'],
     ['*', ' ', ' ', ' ', '*', ' ', ' ', ' ', '*'],
     ['*', ' ', ' ', ' ', '*', ' ', ' ', ' ', '*'],
     ['*', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '*'],
     ['*', '*', '*', '*', '*', '*', '*', '*', '*'],
    ]
)
gwb_2 = GridWorldBase(harder_grid)
gwp_2 = GridWorldPlanner(harder_grid)
print('Harder GridWorld:')
gwb_2.draw()
gwb_2.plan()
print('Random policy:')
gwb_2.draw(True)
print('Better policy:')
gwp_2.plan()
gwp_2.draw(True)

Section 2.2: Markov Decision Process (MDP)¶

Formulating RL problems traditionally happens via a Markov decision process (MDP). In this section, we will introduce all the necessary notation and write code to define the MDP corresponding to our simple GridWorld.

Extra: Martin Puterman’s book on Markov Decision Processes is an excellent reference if you’d like to read more.

Video 3: Markov Decision Process¶

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Markov_Decision_Process_Video")

Coding exercise 2: Create an MDP from the GridWorld specification¶

Create \(P\) and \(R\) matrices for MDP.

class MDPBase(object):
  """This object creates a proper MDP from a GridWorld object."""

  def __init__(self, grid_world: GridWorldBase):
    """Constructs an MDP from a GridWorldBase object.

    Args:
      grid_world: GridWorld specification.
    """
    # Determine how many valid states there are and create empty matrices.
    self.grid_world = grid_world
    self.num_states = np.sum(grid_world.world_spec != '*')
    self.num_actions = len(ACTIONS)
    self.P = np.zeros((self.num_states, self.num_actions, self.num_states))
    self.R = np.zeros((self.num_states, self.num_actions))
    self.pi = np.zeros(self.num_states, dtype=np.int32)
    # Create mapping from cell positions to state ID.
    state_idx = 0
    self.cell_to_state = np.ones(grid_world.world_spec.shape, dtype=np.int32) * -1  # Defaults to -1.
    self.state_to_cell = {}
    for i, row in enumerate(grid_world.world_spec):
      for j, cell in enumerate(row):
        if cell == '*':
          continue
        if cell == 'g':
          self.goal_state = state_idx
        self.cell_to_state[i, j] = state_idx
        self.state_to_cell[state_idx] = (i, j)
        state_idx += 1
    #################################################
    # States should be numbered from left-to-right and from top-to-bottom.
    raise NotImplementedError("Calculate P and R")
    #################################################
    # Assign transition probabilities and rewards accordingly.
    for s in range(...):
      neighbours = ...
      for a, action in enumerate(neighbours):
        nbr = ...
        s2 = self.cell_to_state[..., ...]
        self.P[s, a, s2] = 1.0  # Deterministic transitions
        if s2 == self.goal_state:
          self.R[s, a] = 1.0

  def draw(self, include_policy: bool = False):
    # First make sure we convert our MDP policy into the GridWorld policy.
    for s in range(self.num_states):
      r, c = self.state_to_cell[s]
      self.grid_world.policy[r, c] = ACTIONS[self.pi[s]]
    self.grid_world.draw(include_policy)

  def plan(self):
    """Define a planner
    """
    goal_queue = [self.goal_state]
    goals_done = set()
    #################################################
    # Set the proper actions
    raise NotImplementedError("Implement `plan` function!")
    #################################################
    while goal_queue:
      goal = goal_queue.pop(0)  # pop from front of list
      nbr_states, nbr_actions = ...
      goals_done.add(goal)
      for s, a in zip(..., ...):
        if s == goal:
          continue
        if s not in goals_done:
          self.pi[s] = ...
          goal_queue.append(s)

Click for solution

mdpb = MDPBase(gwb)

# Verify the transitions were properly created.
for i, row in enumerate(mdpb.grid_world.world_spec):
  for j, cell in enumerate(row):
    if cell == '*':
      continue
    neighbours = mdpb.grid_world.get_neighbours((i, j))
    s = mdpb.cell_to_state[i, j]
    for a, action in enumerate(neighbours):
      nbr = neighbours[action]
      s2 = mdpb.cell_to_state[nbr[0], nbr[1]]
      assert np.sum(mdpb.P[s, a, :]) == 1.0
      assert mdpb.P[s, a, s2] == 1.0
      if s2 == mdpb.goal_state:
        assert mdpb.R[s, a] == 1.0
      else:
        assert mdpb.R[s, a] == 0.0

print('P and R matrices successfully created!')
print('GridWorld:')
mdpb.draw()
print('Shortest path policy:')
mdpb.plan()
mdpb.draw(True)

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Create_an_MDP_Exercise")

Section 2.3: \(Q\)-values¶

\(Q\)-values are central to RL algorithms, as they quantify the desirability of performing an action given a particular state. The agent updates these values throughout training and can use its estimates to decide how to act.

Video 4: Q-values¶

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Q_values_Video")

Coding exercise 3: Create a steps-to-go solver¶

Create a new MDP class that holds steps-to-go as Q-values

class MDPToGo(MDPBase):

  def __init__(self, grid_world: GridWorldBase):
    """Constructs an MDP from a GridWorldBase object.

    States should be numbered from left-to-right and from top-to-bottom.

    Args:
      grid_world: GridWorld specification.
    """
    super().__init__(grid_world)
    self.Q = np.zeros((self.num_states, self.num_actions))

  def computeQ(self):
    """Store discounted steps-to-go in an SxA matrix called Q.

    This matrix will then be used to extract the optimal policy.
    """
    #################################################
    # Implement a function to compute Q
    raise NotImplementedError("Implement `ComputeQ` function!")
    #################################################
    goal_queue = [(self.goal_state, 0)]
    goals_done = set()
    while goal_queue:
      goal, steps_to_go = ...  # pop from front of list
      steps_to_go += ...  # Increase the number of steps to goal.
      nbr_states, nbr_actions = ...
      goals_done.add(...)
      for s, a in zip(..., ...):
        if goal == self.goal_state and s == self.goal_state:
          self.Q[s, a] = ...
        elif s == goal:
          # If (s, a) leads to itself then we have an infinite loop (since
          # we're assuming deterministic transitions).
          self.Q[s, a] = ...
        else:
          self.Q[s, a] = ...
        if s not in goals_done:
          goal_queue.append((..., ...))

  def plan(self):
    """Now planning is just doing an argmin over the Q-values!

    Note that this is a little different than standard Q-learning (where we do
    an argmax), since our Q-values currently store steps-to-go.
    """
    self.pi = np.argmin(self.Q, axis=-1)

Click for solution

mdpTg = MDPToGo(gwb)
print('GridWorld:')
mdpTg.draw()
# Compute Q, then extract policy from it.
mdpTg.computeQ()
mdpTg.plan()
print('Optimal policy:')
mdpTg.draw(True)

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Create_a_step_to_go_solver_Exercise")

Section 3: Value iteration¶

Value iteration is an iterative algorithm that continuously improves estimates of \(Q\) and \(V\) by performing the Bellman backup. This assumes access to \(P\) and \(R\) (not typically accessible in RL) but is the backbone of \(Q\)-learning, which we will discuss later.

Did you know? Richard Bellman developed dynamic programming (a core part of any computer science curriculum) precisely for value iteration.

Video 5: Value iteration¶

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Value_Iteration_Video")

Coding exercise 4: Implement value iteration¶

Create a new MDP class that does value iteration

class MDPValueIteration(MDPToGo):

  def __init__(self, grid_world: GridWorldBase, gamma: float = 0.99):
    """Constructs an MDP from a GridWorldBase object.

    States should be numbered from left-to-right and from top-to-bottom.

    Args:
      grid_world: GridWorld specification.
      gamma: Discount factor.
    """
    super().__init__(grid_world)
    self.gamma = gamma

  def computeQ(self, error_tolerance : float = 1e-5):
    """Compute Q and V vectors via value iteration.

    Args:
      error_tolerance: How much error we tolerate between successive Q updates.
    """
    self.Q = np.zeros((self.num_states, self.num_actions))
    num_iterations = 0
    error = np.inf
    #################################################
    # Write this method!
    # First find Q, and then extract V from Q.
    # Hint: Use matrix multiplication instead of for loops!
    raise NotImplementedError("Implement `computeQ` function!")
    #################################################
    while error > error_tolerance:
      new_Q = ...
      max_next_Q = ...
      for a in range(self.num_actions):
        new_Q[:, a] = ...
      error = ...
      self.Q = np.copy(new_Q)
      num_iterations += 1
    self.V = np.max(self.Q, axis=-1)
    print(f'Q and V found in {num_iterations} iterations with an error tolerance of {error_tolerance}.')

  def plan(self):
    """Now planning is just doing an argmax over the Q-values!
    """
    #################################################
    # Note that we're going back to argmax, since the Q-values now represent proper
    # "returns-to-go", so we want to maximize that.
    # Write this method! It should be a one-liner, and very similar to what you
    # used for extracting V from Q.
    raise NotImplementedError("Implement `plan` function!")
    #################################################
    self.pi = ...

  def _draw_v(self):
    """Draw the V values."""
    min_v = np.min(self.V)
    max_v = np.max(self.V)
    wall_v = 2 * min_v - max_v  # Creating a smaller value for walls.
    grid_values = np.ones_like(self.grid_world.world_spec, dtype=np.int32) * wall_v
    # Fill in the V values in grid cells.
    for s in range(self.num_states):
      cell = self.state_to_cell[s]
      grid_values[cell[0], cell[1]] = self.V[s]

    fig, ax = plt.subplots()
    ax.grid(False)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    grid = ax.matshow(grid_values)
    grid.set_clim(wall_v, max_v)
    fig.colorbar(grid)

  def draw(self, draw_mode: str = 'grid'):
    """Draw the GridWorld according to specified mode.

    Args:
      draw_mode: Specification of what mode to draw. Supported options:
                 'grid': Draw the base GridWorld.
                 'policy': Display the policy.
                 'values': Display the values for each state.
    """
    # First make sure we convert our MDP policy into the GridWorld policy.
    if draw_mode == 'values':
      self._draw_v()
    else:
      super().draw(draw_mode == 'policy')

Click for solution

mdpVi = MDPValueIteration(gwb)
print('GridWorld:')
mdpVi.draw()
# Compute Q, then extract policy from it.
mdpVi.computeQ()
mdpVi.plan()
print('Optimal policy:')
mdpVi.draw('policy')
mdpVi.draw('values')

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Implement_Value_Iteration_Exercise")

Section 4: Policy iteration¶

Rather than iterating on estimates of \(Q\) and \(V\) until we’ve reached some form of convergence, why not iterate directly on the policy \(\pi\) instead?

Policy iteration does just that and can sometimes lead to solutions in fewer steps than value iteration.

Video 6: Policy iteration¶

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Policy_Iteration_Video")

Coding exercise 5: Implement policy iteration¶

Create a new MDP class that does policy iteration.

class MDPPolicyIteration(MDPToGo):

  def __init__(self, grid_world: GridWorldBase, gamma: float = 0.99):
    """Constructs an MDP from a GridWorldBase object.

    States should be numbered from left-to-right and from top-to-bottom.

    Args:
      grid_world: GridWorld specification.
      gamma: Discount factor.
    """
    super().__init__(grid_world)
    self.gamma = gamma

  def findPi(self):
    """Find π policy.
    """
    self.Q = np.zeros((self.num_states, self.num_actions))
    self.pi = np.zeros(self.num_states, dtype=np.int32)
    num_iterations = 0
    #################################################
    # Compute π, which involves computing Q.
    # Once you have π and Q, find V.
    # Hint: Your value iteration solution will be useful here.
    raise NotImplementedError("Implement `findPi` function!")
    #################################################
    new_pi = ...  # initialize to ones
    while np.any(new_pi != self.pi):
      new_pi = ...
      new_Q = ...  # initialize to zeros
      next_V = ...
      for a in range(self.num_actions):
        new_Q[:, a] = ...
      self.Q = np.copy(new_Q)
      self.pi = ...
      num_iterations += 1
    self.V = ...
    print(f'Q and V found in {num_iterations} iterations.')

  def _draw_v(self):
    """Draw the V values."""
    min_v = np.min(self.V)
    max_v = np.max(self.V)
    wall_v = 2 * min_v - max_v  # Creating a smaller value for walls.
    grid_values = np.ones_like(self.grid_world.world_spec, dtype=np.int32) * wall_v
    # Fill in the V values in grid cells.
    for s in range(self.num_states):
      cell = self.state_to_cell[s]
      grid_values[cell[0], cell[1]] = self.V[s]

    fig, ax = plt.subplots()
    ax.grid(False)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    grid = ax.matshow(grid_values)
    grid.set_clim(wall_v, max_v)
    fig.colorbar(grid)

  def draw(self, draw_mode: str = 'grid'):
    """Draw the GridWorld according to specified mode.

    Args:
      draw_mode: Specification of what mode to draw. Supported options:
                 'grid': Draw the base GridWorld.
                 'policy': Display the policy.
                 'values': Display the values for each state.
    """
    # First make sure we convert our MDP policy into the GridWorld policy.
    if draw_mode == 'values':
      self._draw_v()
    else:
      super().draw(draw_mode == 'policy')

Click for solution

mdpPi = MDPPolicyIteration(gwb)
print('GridWorld:')
mdpPi.draw()
# Compute Q, then extract policy from it.
mdpPi.findPi()
print('Optimal policy:')
mdpPi.draw('policy')
mdpPi.draw('values')

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Implement_Policy_Iteration_Exercise")

Section 5: \(Q\)-learning algorithm¶

RL assumes we don’t have access to \(P\) nor \(R\), so we can use neither value nor policy iteration to find an optimal behavior for our agent.

\(Q\)-learning, however, incorporates the Bellman backup into an online learning algorithm: \(Q\)-learning, which can be shown to converge to the true \(Q\)-values (under mild conditions)!

Video 7: Q-learning¶

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Q_learning_Video")

Coding exercise 6: Implement Q-learning¶

Create a Q-learning class

class QLearner(MDPValueIteration):

  def __init__(self, grid_world: GridWorldBase, gamma: float = 0.99):
    """Constructs an MDP from a GridWorldBase object.

    States should be numbered from left-to-right and from top-to-bottom.

    Args:
      grid_world: GridWorld specification.
      gamma: Discount factor.
    """
    super().__init__(grid_world, gamma)
    self.Q = np.zeros((self.num_states, self.num_actions))
    # Pick an initial state randomly.
    self.current_state = np.random.choice(self.num_states)

  def step(self, action: int) -> Tuple[int, float]:
    """Take a step in MDP from self.current_state.

    Args:
      action: Action to take.

    Returns:
      Next state and reward received.
    """
    new_state = np.random.choice(self.num_states,
                                 p=self.P[self.current_state, action, :])
    return (new_state, self.R[self.current_state, action])

  def pickAction(self) -> int:
    """Pick the best action from the current state and Q-value estimates."""
    return np.argmax(self.Q[self.current_state, :])

  def maybeReset(self):
    """If current_state is goal, reset to a random state."""
    if self.current_state == self.goal_state:
      self.current_state = np.random.choice(self.num_states)

  def learnQ(self, alpha: float = 0.1, max_steps: int = 10_000):
    """Learn the Q-function by interacting with the environment.

    Args:
      alpha: Learning rate.
      max_steps: Maximum number of steps to take.
    """
    self.Q = np.zeros((self.num_states, self.num_actions))
    num_steps = 0
    #################################################
    # Hint: Use the step(), pickAction(), and maybeReset() functions above.
    # Note: The way you initialize the Q-values is crucial here. Try first with
    # an all-zeros initialization (as is currently coded below). If it doesn't
    # work, try a different initialization.
    # Hint: The maximum possible value (given the rewards are in [0, 1]) is
    #       1 / (1 - gamma).
    raise NotImplementedError("Write `learnQ` function!")
    #################################################
    while num_steps < max_steps:
      a = ...
      new_state, r = ...
      td = ...
      self.Q[self.current_state, a] += ...
      self.current_state = ...
      self.maybeReset()
      num_steps += 1
    self.V = ...


  def plan(self):
    """Now planning is just doing an argmin over the Q-values!

    Note that this is a little different than standard Q-learning (where we do
    an argmax), since our Q-values currently store steps-to-go.
    """
    self.pi = np.argmax(self.Q, axis=-1)
    self.V = np.max(self.Q, axis=-1)

Click for solution

base_q_learner = QLearner(gwb)
print('GridWorld:')
base_q_learner.draw()
# Compute Q, then extract policy from it.
base_q_learner.learnQ()
base_q_learner.plan()
print('Optimal policy:')
base_q_learner.draw('policy')
base_q_learner.draw('values')

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Implement_Q_learning_Exercise")

Section 6: \(\epsilon\)-greedy exploitation¶

Should an agent exploit its current estimates and policy, or should it explore the environment in case better policies are out there? This exploration-exploitation dilemma is a central problem in RL.

In this section, we explore one of the simplest yet most effective methods for this tradeoff: the so-called \(\epsilon\)-greedy exploration.

Video 8: Epsilon-greedy exploration¶

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Epsilon_greedy_exploration_Video")

Coding exercise 7: Implement epsilon-greedy exploration¶

Create a Q-learning class with epsilon-greedy exploration

class QLearnerExplorer(QLearner):

  def __init__(self, grid_world: GridWorldBase, gamma: float = 0.99,
               epsilon: float = 0.1):
    """Constructs an MDP from a GridWorldBase object.

    States should be numbered from left-to-right and from top-to-bottom.

    Args:
      grid_world: GridWorld specification.
      gamma: Discount factor.
      epsilon: Exploration rate.
    """
    super().__init__(grid_world, gamma)
    self.epsilon = epsilon

  def pickAction(self):
    """Pick the next action from the current state.
    """
    #################################################
    # With probability epsilon will pick the next action randomly, otherwise will
    # pick based on the Q-value estimates.
    # Hint: It should only be a few lines of code!
    raise NotImplementedError("Write the `pickAction` function!")
    #################################################
    if ... < ...:
      return ...
    return ...

Click for solution

explorer = QLearnerExplorer(gwb)
print('GridWorld:')
explorer.draw()
# Compute Q, then extract policy from it.
explorer.learnQ()
explorer.plan()
print('Optimal policy:')
explorer.draw('policy')
explorer.draw('values')

Submit your feedback¶

# @title Submit your feedback
content_review(f"{feedback_prefix}_Implement_epsilon_greedy_exploration_Exercise")

Summary¶

Reinforcement learning is important in artificial intelligence because it allows an agent to learn how to make decisions in complex environments based on trial and error. Understanding how an agent learns and the various algorithms used to facilitate this learning is crucial for developing effective RL systems.

The Gridworld environment is a commonly used testbed for RL algorithms and provides a simple yet challenging environment for agents to learn in. By mastering the Gridworld environment, researchers can apply the same principles to more complex tasks.
\(Q\)-values are a fundamental concept in RL, representing the expected reward an agent will receive by taking a certain action in a given state. Computing \(Q\)-values is essential for many RL algorithms, including \(Q\)-learning.
The value and policy iteration algorithms are important because they provide a way to compute optimal policies for an agent in a given environment. These algorithms help an agent make decisions leading to the greatest cumulative reward over time.
\(Q\)-learning is a widely used RL algorithm known for its simplicity and effectiveness. It is essential because it allows an agent to learn how to make decisions in an environment without requiring a model of the environment and can be applied to a wide range of tasks.
Finally, \(\epsilon\)-greedy exploration is an important concept in RL because it helps to balance the exploration-exploitation tradeoff. Occasionally choosing a random action allows an agent to explore new areas of the environment and potentially discover better policies.

Daily survey¶

Don’t forget to complete your reflections and content check in the daily survey! Please be patient after logging in as there is a small delay before you will be redirected to the survey.

Neuromatch Academy: Deep Learning (Experimental)

Tutorial 1: Basic Reinforcement Learning

Contents

Tutorial 1: Basic Reinforcement Learning¶

Tutorial Objectives¶

Setup¶

Install and import feedback gadget¶

Section 1: A history of RL¶

Video 1: Intro to RL¶

Submit your feedback¶

Section 2: What is RL¶

Section 2.1: Grid World¶

Video 2: Grid World¶

Submit your feedback¶

Coding Exercise 1: Code a shortest-path planner for GridWorld¶

Create the GridWorldPlanner object (defaults to simple example)¶

Submit your feedback¶

Try it out in a harder problem.¶

Section 2.2: Markov Decision Process (MDP)¶

Video 3: Markov Decision Process¶

Submit your feedback¶

Coding exercise 2: Create an MDP from the GridWorld specification¶

Submit your feedback¶

Section 2.3: \(Q\)-values¶

Video 4: Q-values¶

Submit your feedback¶

Coding exercise 3: Create a steps-to-go solver¶

Submit your feedback¶

Section 3: Value iteration¶

Video 5: Value iteration¶

Submit your feedback¶

Coding exercise 4: Implement value iteration¶

Submit your feedback¶

Section 4: Policy iteration¶

Video 6: Policy iteration¶

Submit your feedback¶

Coding exercise 5: Implement policy iteration¶

Submit your feedback¶

Section 5: \(Q\)-learning algorithm¶

Video 7: Q-learning¶

Submit your feedback¶

Coding exercise 6: Implement Q-learning¶

Submit your feedback¶

Section 6: \(\epsilon\)-greedy exploitation¶

Video 8: Epsilon-greedy exploration¶

Submit your feedback¶

Coding exercise 7: Implement epsilon-greedy exploration¶

Submit your feedback¶

Summary¶

Daily survey¶