Playing Cartpole with natural deep reinforcement learning

Introduction

There are two kinds of methods in reinforcement learning, tabular methods and approximate methods. The purpose of RL is to get an optimal policy, which tells you to choose what action A when you at state S. If the state and aciton spaces are small enough, value fucntion can be represented as arrays, or tables. The problem with large state space is not just the memory needed for large tables, but the time and data needed to fill them accurately. If the state and aciton spaces are too large, due to the limitions of time and data, value functions need to be approximated with limited computational resources. In this case, out goal instead is to find a good enough approximate solution compared to optimal solution. This post is some reading notes for Playing Atari with Deep Reinforcement learning which published by DeepMind(2013). Instead playing Atari games, I play CartPole games whose simulation environment is provided by Openai Gym. I implement the algorithm proposed by the DeepMind which known as DQN to solve CartPole. Besides, a discrete-q-learing method is discussed compared to DQN.

CartPole is a classic control task which has infinite state space and finite action space. It is difficult to solve the problem use tabular methods. A neural network can solve this problem to learn successful control polcies. The network is trained with a variant of the Q-learning algorithm, with stochastic gradient descent to update the weights. To alleviate the problems of correlated data and non-stationary distributions, we use an experience replay mechanism which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors.

Background

We consider tasks in which an agent interacts with an environment , in this case the openai gym emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action from the set of legal game actions, {0, 1}. The action is passed to the emulator and it returns a list of next_state, reward, done.

We consider sequences of actions and observations, and learn game strategies that depend opon these sequences. All sequences in the emulator are assumed to terminate in a finite number of time-steps. This formalism gives rise to a large but finite Markov decision process(MDP) in which each sequence is a distinct state. As a result, we can apply standard reinforcement learning methods for MDPs, simply by using the complete sequence as the state representation at time t.

The goal of the agent is to interact with the agent by selecting actions in a way that maximize future rewards. We some following definitons:

We make the standard assumption that future rewards are discounted by a factor of per time-step.
The future discounted return at time as , where is the time step at which the game terminates.
Optimal action-value function as the maximum expected return achievable by following any strategy, after seeing some sequence and then taking some action , , where is a policy mapping sequences to actions (or distributions over actions).

The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intuition: if the optimal value of the sequence at the next time-step was known for all possible actions , then the optimal strategy is to select the action maximizing the expected value of

Such value iteration algorithms converge to the optimal action value function, as

In practice, this basic approach is totally impractical, because the action-value function is estimated separately for each sequence, without any generalisation. Instead, it is common to use a function approximator to estimate the action-value function, . In the reinforcement learning community this is typically a linear function approximator, but sometimes a non-linear function aproximator is used instead, such as a neural network. We refer to a neural network function approximator with weights as a Q-network. A Q-network can be trained by minimizing a sequence of loss functions that changes at each iteration i.

where is the target for iteration and is a probability distribution over sequences and aciton that we refer to as the behaviour distribution.

Deep Reinforcement Learning

DQN pseudocode

Code

`dqn.py`

import math, random
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd 
import torch.nn.functional as F
import matplotlib.pyplot as plt
import time
from collections import deque
import GPUtil
from threading import Thread
import time

action_space = 2
state_space = 4

# Replay buffer 
class ReplayBuffer(object):
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        state = np.expand_dims(state, 0)
        next_state = np.expand_dims(next_state, 0)
            
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
        return np.concatenate(state), action, reward, np.concatenate(next_state), done
    
    def __len__(self):
        return len(self.buffer)

# Deep Q Network
class DQN(nn.Module):
    def __init__(self, num_inputs, num_actions, device):
        super(DQN, self).__init__()
        self.device = device
        self.layers = nn.Sequential(
            nn.Linear(state_space, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_space)
        ).to(self.device)
        
    def forward(self, x):
        return self.layers(x)
    
    def act(self, state, epsilon):
        if random.random() > epsilon:
            state = torch.FloatTensor(state).to(self.device)
            q_value = self.forward(state)
            action = q_value.max(0)[1].item()
        else:
            action = random.randrange(action_space)
        return action

# Computing TD difference
def compute_td_loss(model, batch_size, gamma, replay_buffer, optimizer, device):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)

    state = torch.FloatTensor(state).to(device)
    next_state = torch.FloatTensor(next_state).to(device)
    action = torch.LongTensor(action).to(device)
    reward = torch.FloatTensor(reward).to(device)
    done = torch.FloatTensor(done).to(device)
    q_values = model.forward(state).to(device)
    next_q_values = model.forward(next_state).to(device)
    # the value corresponding to the (s, a)
    q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
    next_q_value = next_q_values.max(1)[0]
    expected_q_value = reward + gamma * next_q_value * (1 - done)
    loss = (q_value - expected_q_value.data).pow(2).mean()
    print('-'*50)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss

# Print the size of the model
def print_model_size():
    para = sum([np.prod(list(p.size())) for p in model.parameters()])
    print('Model {} : params: {:4f}M'.format(model._get_name(), para * 4 / 1000 / 1000))

# Print GPU usage
class Monitor(Thread):
    def __init__(self, delay):
        super(Monitor, self).__init__()
        self.stopped = False
        self.delay = delay # Time between calls to GPUtil
        self.start()

    def run(self):
        while not self.stopped:
            GPUtil.showUtilization()
            time.sleep(self.delay)

    def stop(self):
        self.stopped = True

# Get the training time cost
def timeit(func):
    def inner():
        start_time = time.time()
        func()
        end_time = time.time()
        print("The Training is finished. Cost ", end_time-start_time, " S")
    return inner

`traindqn.py`

from dqn import *
from torch.utils.tensorboard import SummaryWriter
import itertools

writer = SummaryWriter()

# Gym CartPole environment
env = gym.make("CartPole-v0")

# Create a DQN model using CUDA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DQN(env.observation_space.shape[0], env.action_space.n, device)
if device == "cuda":
    print("Using GPU")
    model.to(device)

# Parameters
## Epsilon greedy exploration
epsilon_start = 0.9
epsilon_final = 0.05
epsilon_decay = 200
epsilon_by_episode = lambda frame_idx: epsilon_final + (epsilon_start - epsilon_final) * math.exp(-1. * frame_idx / epsilon_decay)
## gamma
gamma = 0.99
## Training parameters
episodes = 1000
batch_size = 4
losses = []
all_rewards = []
episode_reward = 0

# Computing Temporal Difference Loss
replay_buffer = ReplayBuffer(5000)

# Create a optimizer
optimizer = optim.Adam(model.parameters(), lr=0.0001)

# Trainning
@timeit
def train():
    global gamma, num_frames, batch_size, losses, all_rewards, episode_reward
    state = env.reset()
    for episode_idx in range(1, episodes + 1):
        epsilon = epsilon_by_episode(episode_idx)
        for i in itertools.count():
            action = model.act(state, epsilon)
            #print(action)
            
            next_state, reward, done, _ = env.step(action)
            replay_buffer.push(state, action, reward, next_state, done)
            
            state = next_state
            episode_reward += reward
            
            if done:
                state = env.reset()
                all_rewards.append(episode_reward)
                print(episode_reward)
                episode_reward = 0
                break
                
            if len(replay_buffer) > batch_size:
                loss = compute_td_loss(model, batch_size, gamma, replay_buffer, optimizer, device)
                losses.append(loss.item())
    #plot(all_rewards, losses)
            
        #if frame_idx % 200 == 0:
            #plot(frame_idx, all_rewards, losses)
    plt.plot(all_rewards)
    plt.show()

def plot_policy():
    pole_angle = np.random.uniform(-env.observation_space.high[2], env.observation_space.high[2], 10000)
    pole_velocity = np.random.uniform(-math.radians(50), math.radians(50), 10000)
    action_list = []
    for i in range(len(pole_angle)):
        state = env.reset()
        state[2] = pole_angle[i]
        state[3] = pole_velocity[i]
        #print(state, model.act(state, 0))
        action = model.act(state, 0)
        action_list.append(action)
    colors = {0:'blue',1:'red'}
    colors = ['blue' if i==0 else 'red' for i in action_list]
    labels = ['Left','Right']
    print(colors)
    import matplotlib.patches as mpatches
    from matplotlib.colors import ListedColormap
    fig = plt.figure(5, figsize=[7,7])
    ax = fig.gca()
    plt.set_cmap('brg')
    surf = ax.scatter(pole_angle,pole_velocity, c=colors)
    ax.set_xlabel('pole_angle')
    ax.set_ylabel('pole_velocity')
    ax.set_title('Policy')
    plt.show()

# Instantiate monitor with a 10-second delay between updates
monitor = Monitor(2)

#writer.add_graph(model, torch.Tensor([1.0, 1.0, 1.0, 1.0]).to(device))
# Train here
train()
plot_policy()

# Close monitor
monitor.stop()

Result

After training for 1000 episodes, we get the following policy:

Code Comments

How to compute TD Loss

def compute_td_loss(model, batch_size, gamma, replay_buffer, optimizer, device):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)

    state = torch.FloatTensor(state).to(device)
    next_state = torch.FloatTensor(next_state).to(device)
    action = torch.LongTensor(action).to(device)
    reward = torch.FloatTensor(reward).to(device)
    done = torch.FloatTensor(done).to(device)

    q_values = model.forward(state).to(device)
    next_q_values = model.forward(next_state).to(device)
    # the value corresponding to the (s, a)
    q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
    next_q_value = next_q_values.max(1)[0]
    expected_q_value = reward + gamma * next_q_value * (1 - done)
    loss = (q_value - expected_q_value.data).pow(2).mean()
    print('-'*50)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss

In contrast to TD-Gammon and similar online approaches, we utilize a technique known as experience replay where we store the agent's experiences at each time-step, in a data-set , pooled over many episodes into a replay memory.

Since we have known the pair , we can get the q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1). Then, we know next_state, compute next_q_value. Choose which maximize the next_q_value.

Reference

[1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).

[2] https://stackoverflow.com/questions/50999977/what-does-the-gather-function-do-in-pytorch-in-layman-terms

[3] https://stackoverflow.com/questions/57237352/what-does-unsqueeze-do-in-pytorch

[4] https://gist.github.com/ts1829/62b9ae7687d79279b9381bb6808bf6ab#file-mountain-car-v0-q-learning-modified-reward-ipynb