0%

Playing Cartpole with natural deep reinforcement learning

Introduction

There are two kinds of methods in reinforcement learning, tabular methods and approximate methods. The purpose of RL is to get an optimal policy, which tells you to choose what action A when you at state S. If the state and aciton spaces are small enough, value fucntion can be represented as arrays, or tables. The problem with large state space is not just the memory needed for large tables, but the time and data needed to fill them accurately. If the state and aciton spaces are too large, due to the limitions of time and data, value functions need to be approximated with limited computational resources. In this case, out goal instead is to find a good enough approximate solution compared to optimal solution. This post is some reading notes for Playing Atari with Deep Reinforcement learning which published by DeepMind(2013). Instead playing Atari games, I play CartPole games whose simulation environment is provided by Openai Gym. I implement the algorithm proposed by the DeepMind which known as DQN to solve CartPole. Besides, a discrete-q-learing method is discussed compared to DQN.

CartPole is a classic control task which has infinite state space and finite action space. It is difficult to solve the problem use tabular methods. A neural network can solve this problem to learn successful control polcies. The network is trained with a variant of the Q-learning algorithm, with stochastic gradient descent to update the weights. To alleviate the problems of correlated data and non-stationary distributions, we use an experience replay mechanism which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors.

Background

We consider tasks in which an agent interacts with an environment \(\epsilon\), in this case the openai gym emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action \(a_t\) from the set of legal game actions, \(A=\){0, 1}. The action is passed to the emulator and it returns a list of next_state, reward, done.

We consider sequences of actions and observations, \(s_t=x_1, a_1, x_2,..., a_{t-1},x_t\) and learn game strategies that depend opon these sequences. All sequences in the emulator are assumed to terminate in a finite number of time-steps. This formalism gives rise to a large but finite Markov decision process(MDP) in which each sequence is a distinct state. As a result, we can apply standard reinforcement learning methods for MDPs, simply by using the complete sequence \(s_t\) as the state representation at time t.

The goal of the agent is to interact with the agent by selecting actions in a way that maximize future rewards. We some following definitons:

  • We make the standard assumption that future rewards are discounted by a factor of \(\gamma\) per time-step.
  • The future discounted return at time \(t\) as \(R_t=\Sigma_{t'=t}^{T}\gamma^{t'-t}r_{t'}\), where \(T\) is the time step at which the game terminates.
  • Optimal action-value function \(Q^*(s, a)\) as the maximum expected return achievable by following any strategy, after seeing some sequence \(s\) and then taking some action \(a\), \(Q^*(s, a)=max_\pi E[R _ t|s _ t=s,a _ t=a,\pi]\), where \(\pi\) is a policy mapping sequences to actions (or distributions over actions).

The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intuition: if the optimal value \(Q^*(s', a')\) of the sequence \(s'\) at the next time-step was known for all possible actions \(a'\), then the optimal strategy is to select the action \(a'\) maximizing the expected value of \(r+\gamma Q^*(s', a')\)

\[Q^*(s, a)=E_{s' \in \epsilon}[r+\gamma max_{a'}Q^*(s', a'|s, a)]\]

Such value iteration algorithms converge to the optimal action value function, \(Q_i \rightarrow Q^*(s, a)\) as \(i \rightarrow \infty\)

In practice, this basic approach is totally impractical, because the action-value function is estimated separately for each sequence, without any generalisation. Instead, it is common to use a function approximator to estimate the action-value function, \(Q(s, a; \theta)=Q^*(s, a)\). In the reinforcement learning community this is typically a linear function approximator, but sometimes a non-linear function aproximator is used instead, such as a neural network. We refer to a neural network function approximator with weights \(\theta\) as a Q-network. A Q-network can be trained by minimizing a sequence of loss functions \(L_i(\theta_i)\) that changes at each iteration i.

\[L_i(\theta_i)=E_{s,a \in \rho(\cdot)}[(y _ i-Q(s,a;\theta _ i))^2]\]

where \(y_i=E_{s' \in \epsilon}[r+\gamma max_{a'}Q(s', a';\theta_{i-1}|s, a)]\) is the target for iteration \(i\) and \(\rho(s, a)\) is a probability distribution over sequences \(s\) and aciton \(a\) that we refer to as the behaviour distribution.

Deep Reinforcement Learning

DQN pseudocode

dqn_algorithm

Code

dqn.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import math, random
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd
import torch.nn.functional as F
import matplotlib.pyplot as plt
import time
from collections import deque
import GPUtil
from threading import Thread
import time

action_space = 2
state_space = 4

# Replay buffer
class ReplayBuffer(object):
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)

def push(self, state, action, reward, next_state, done):
state = np.expand_dims(state, 0)
next_state = np.expand_dims(next_state, 0)

self.buffer.append((state, action, reward, next_state, done))

def sample(self, batch_size):
state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
return np.concatenate(state), action, reward, np.concatenate(next_state), done

def __len__(self):
return len(self.buffer)

# Deep Q Network
class DQN(nn.Module):
def __init__(self, num_inputs, num_actions, device):
super(DQN, self).__init__()
self.device = device
self.layers = nn.Sequential(
nn.Linear(state_space, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, action_space)
).to(self.device)

def forward(self, x):
return self.layers(x)

def act(self, state, epsilon):
if random.random() > epsilon:
state = torch.FloatTensor(state).to(self.device)
q_value = self.forward(state)
action = q_value.max(0)[1].item()
else:
action = random.randrange(action_space)
return action

# Computing TD difference
def compute_td_loss(model, batch_size, gamma, replay_buffer, optimizer, device):
state, action, reward, next_state, done = replay_buffer.sample(batch_size)

state = torch.FloatTensor(state).to(device)
next_state = torch.FloatTensor(next_state).to(device)
action = torch.LongTensor(action).to(device)
reward = torch.FloatTensor(reward).to(device)
done = torch.FloatTensor(done).to(device)
q_values = model.forward(state).to(device)
next_q_values = model.forward(next_state).to(device)
# the value corresponding to the (s, a)
q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
next_q_value = next_q_values.max(1)[0]
expected_q_value = reward + gamma * next_q_value * (1 - done)
loss = (q_value - expected_q_value.data).pow(2).mean()
print('-'*50)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss

# Print the size of the model
def print_model_size():
para = sum([np.prod(list(p.size())) for p in model.parameters()])
print('Model {} : params: {:4f}M'.format(model._get_name(), para * 4 / 1000 / 1000))

# Print GPU usage
class Monitor(Thread):
def __init__(self, delay):
super(Monitor, self).__init__()
self.stopped = False
self.delay = delay # Time between calls to GPUtil
self.start()

def run(self):
while not self.stopped:
GPUtil.showUtilization()
time.sleep(self.delay)

def stop(self):
self.stopped = True

# Get the training time cost
def timeit(func):
def inner():
start_time = time.time()
func()
end_time = time.time()
print("The Training is finished. Cost ", end_time-start_time, " S")
return inner

traindqn.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
from dqn import *
from torch.utils.tensorboard import SummaryWriter
import itertools

writer = SummaryWriter()

# Gym CartPole environment
env = gym.make("CartPole-v0")

# Create a DQN model using CUDA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DQN(env.observation_space.shape[0], env.action_space.n, device)
if device == "cuda":
print("Using GPU")
model.to(device)

# Parameters
## Epsilon greedy exploration
epsilon_start = 0.9
epsilon_final = 0.05
epsilon_decay = 200
epsilon_by_episode = lambda frame_idx: epsilon_final + (epsilon_start - epsilon_final) * math.exp(-1. * frame_idx / epsilon_decay)
## gamma
gamma = 0.99
## Training parameters
episodes = 1000
batch_size = 4
losses = []
all_rewards = []
episode_reward = 0

# Computing Temporal Difference Loss
replay_buffer = ReplayBuffer(5000)

# Create a optimizer
optimizer = optim.Adam(model.parameters(), lr=0.0001)

# Trainning
@timeit
def train():
global gamma, num_frames, batch_size, losses, all_rewards, episode_reward
state = env.reset()
for episode_idx in range(1, episodes + 1):
epsilon = epsilon_by_episode(episode_idx)
for i in itertools.count():
action = model.act(state, epsilon)
#print(action)

next_state, reward, done, _ = env.step(action)
replay_buffer.push(state, action, reward, next_state, done)

state = next_state
episode_reward += reward

if done:
state = env.reset()
all_rewards.append(episode_reward)
print(episode_reward)
episode_reward = 0
break

if len(replay_buffer) > batch_size:
loss = compute_td_loss(model, batch_size, gamma, replay_buffer, optimizer, device)
losses.append(loss.item())
#plot(all_rewards, losses)

#if frame_idx % 200 == 0:
#plot(frame_idx, all_rewards, losses)
plt.plot(all_rewards)
plt.show()

def plot_policy():
pole_angle = np.random.uniform(-env.observation_space.high[2], env.observation_space.high[2], 10000)
pole_velocity = np.random.uniform(-math.radians(50), math.radians(50), 10000)
action_list = []
for i in range(len(pole_angle)):
state = env.reset()
state[2] = pole_angle[i]
state[3] = pole_velocity[i]
#print(state, model.act(state, 0))
action = model.act(state, 0)
action_list.append(action)
colors = {0:'blue',1:'red'}
colors = ['blue' if i==0 else 'red' for i in action_list]
labels = ['Left','Right']
print(colors)
import matplotlib.patches as mpatches
from matplotlib.colors import ListedColormap
fig = plt.figure(5, figsize=[7,7])
ax = fig.gca()
plt.set_cmap('brg')
surf = ax.scatter(pole_angle,pole_velocity, c=colors)
ax.set_xlabel('pole_angle')
ax.set_ylabel('pole_velocity')
ax.set_title('Policy')
plt.show()

# Instantiate monitor with a 10-second delay between updates
monitor = Monitor(2)

#writer.add_graph(model, torch.Tensor([1.0, 1.0, 1.0, 1.0]).to(device))
# Train here
train()
plot_policy()

# Close monitor
monitor.stop()

Result

After training for 1000 episodes, we get the following policy:

policy ## Code Comments ### How to compute TD Loss

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def compute_td_loss(model, batch_size, gamma, replay_buffer, optimizer, device):
state, action, reward, next_state, done = replay_buffer.sample(batch_size)

state = torch.FloatTensor(state).to(device)
next_state = torch.FloatTensor(next_state).to(device)
action = torch.LongTensor(action).to(device)
reward = torch.FloatTensor(reward).to(device)
done = torch.FloatTensor(done).to(device)

q_values = model.forward(state).to(device)
next_q_values = model.forward(next_state).to(device)
# the value corresponding to the (s, a)
q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
next_q_value = next_q_values.max(1)[0]
expected_q_value = reward + gamma * next_q_value * (1 - done)
loss = (q_value - expected_q_value.data).pow(2).mean()
print('-'*50)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss
In contrast to TD-Gammon and similar online approaches, we utilize a technique known as experience replay where we store the agent's experiences at each time-step, \(e_t=(s_t, a_t, r_t, _{t+1})\) in a data-set \(D=e_1, e_2, ... ,e_N\), pooled over many episodes into a replay memory.

Since we have known the pair \((s_t, a_t)\), we can get the q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1). Then, we know next_state, compute next_q_value. Choose \(a'\) which maximize the next_q_value.

Reference

[1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).

[2] https://stackoverflow.com/questions/50999977/what-does-the-gather-function-do-in-pytorch-in-layman-terms

[3] https://stackoverflow.com/questions/57237352/what-does-unsqueeze-do-in-pytorch

[4] https://gist.github.com/ts1829/62b9ae7687d79279b9381bb6808bf6ab#file-mountain-car-v0-q-learning-modified-reward-ipynb

If you like my blog, please donate for me.