There are two kinds of methods in reinforcement learning, tabular methods and approximate methods. The purpose of RL is to get an optimal policy, which tells you to choose what action A when you at state S. If the state and aciton spaces are small enough, value fucntion can be represented as arrays, or tables. The problem with large state space is not just the memory needed for large tables, but the time and data needed to fill them accurately. If the state and aciton spaces are too large, due to the limitions of time and data, value functions need to be approximated with limited computational resources. In this case, out goal instead is to find a good enough approximate solution compared to optimal solution. This post is some reading notes for Playing Atari with Deep Reinforcement learning which published by DeepMind(2013). Instead playing Atari games, I play CartPole games whose simulation environment is provided by Openai Gym. I implement the algorithm proposed by the DeepMind which known as DQN to solve CartPole. Besides, a discrete-q-learing method is discussed compared to DQN.
CartPole is a classic control task which has infinite state space and finite action space. It is difficult to solve the problem use tabular methods. A neural network can solve this problem to learn successful control polcies. The network is trained with a variant of the Q-learning algorithm, with stochastic gradient descent to update the weights. To alleviate the problems of correlated data and non-stationary distributions, we use an experience replay mechanism which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors.
Background
We consider tasks in which an agent interacts with an environment , in this case the openai gym emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action from the set of legal game actions, {0, 1}. The action is passed to the emulator and it returns a list of next_state, reward, done.
We consider sequences of actions and observations, and learn game strategies that depend opon these sequences. All sequences in the emulator are assumed to terminate in a finite number of time-steps. This formalism gives rise to a large but finite Markov decision process(MDP) in which each sequence is a distinct state. As a result, we can apply standard reinforcement learning methods for MDPs, simply by using the complete sequence as the state representation at time t.
The goal of the agent is to interact with the agent by selecting actions in a way that maximize future rewards. We some following definitons:
We make the standard assumption that future rewards are discounted by a factor of per time-step.
The future discounted return at time as , where is the time step at which the game terminates.
Optimal action-value function as the maximum expected return achievable by following any strategy, after seeing some sequence and then taking some action , , where is a policy mapping sequences to actions (or distributions over actions).
The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intuition: if the optimal value of the sequence at the next time-step was known for all possible actions , then the optimal strategy is to select the action maximizing the expected value of
Such value iteration algorithms converge to the optimal action value function, as
In practice, this basic approach is totally impractical, because the action-value function is estimated separately for each sequence, without any generalisation. Instead, it is common to use a function approximator to estimate the action-value function, . In the reinforcement learning community this is typically a linear function approximator, but sometimes a non-linear function aproximator is used instead, such as a neural network. We refer to a neural network function approximator with weights as a Q-network. A Q-network can be trained by minimizing a sequence of loss functions that changes at each iteration i.
where is the target for iteration and is a probability distribution over sequences and aciton that we refer to as the behaviour distribution.
import math, random import gym import numpy as np import torch import torch.nn as nn import torch.optim as optim import torch.autograd as autograd import torch.nn.functional as F import matplotlib.pyplot as plt import time from collections import deque import GPUtil from threading import Thread import time
state = torch.FloatTensor(state).to(device) next_state = torch.FloatTensor(next_state).to(device) action = torch.LongTensor(action).to(device) reward = torch.FloatTensor(reward).to(device) done = torch.FloatTensor(done).to(device) q_values = model.forward(state).to(device) next_q_values = model.forward(next_state).to(device) # the value corresponding to the (s, a) q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1) next_q_value = next_q_values.max(1)[0] expected_q_value = reward + gamma * next_q_value * (1 - done) loss = (q_value - expected_q_value.data).pow(2).mean() print('-'*50) optimizer.zero_grad() loss.backward() optimizer.step() return loss
# Print the size of the model defprint_model_size(): para = sum([np.prod(list(p.size())) for p in model.parameters()]) print('Model {} : params: {:4f}M'.format(model._get_name(), para * 4 / 1000 / 1000))
# Print GPU usage classMonitor(Thread): def__init__(self, delay): super(Monitor, self).__init__() self.stopped = False self.delay = delay # Time between calls to GPUtil self.start()
# Get the training time cost deftimeit(func): definner(): start_time = time.time() func() end_time = time.time() print("The Training is finished. Cost ", end_time-start_time, " S") return inner
# Create a DQN model using CUDA device = torch.device("cuda"if torch.cuda.is_available() else"cpu") model = DQN(env.observation_space.shape[0], env.action_space.n, device) if device == "cuda": print("Using GPU") model.to(device)
q_values = model.forward(state).to(device) next_q_values = model.forward(next_state).to(device) # the value corresponding to the (s, a) q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1) next_q_value = next_q_values.max(1)[0] expected_q_value = reward + gamma * next_q_value * (1 - done) loss = (q_value - expected_q_value.data).pow(2).mean() print('-'*50) optimizer.zero_grad() loss.backward() optimizer.step() return loss
In contrast to TD-Gammon and similar online approaches, we utilize a technique known as experience replay where we store the agent's experiences at each time-step, in a data-set , pooled over many episodes into a replay memory.
Since we have known the pair , we can get the q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1). Then, we know next_state, compute next_q_value. Choose which maximize the next_q_value.
Reference
[1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).