Play with CartPole

Gym is a toolkit for developing and comparing reinforcement learning algorithms. The Gym library is a collection of test problems - environments - that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.

CartPole-v1

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent if from falling over. Areward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

Openai Gym CartPole-v1

Source:

This environment corresponds to the version of the Reinforcement Learningcart-pole problem described by Barto, Sutton, and Anderson.

Observation:

Type: Box(4)Reinforcement Learning
Num	Observation                 Min        Reinforcement Learning Max
0	Cart Position             -4.8            4.8
1	Cart Velocity             -Inf            Inf
2	Pole Angle                 -24 deg        24 deg
3	Pole Velocity At Tip      -Inf            Inf

Actions:

Type: Discrete(2)
Num	Action
0	Push cart to the left
1	Push cart to the right

Note: The amount the velocity that is reduced or increased is not fixed; it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it

Reward:

Reward is 1 for every step taken, including the termination step

Starting State:

All observations are assigned a uniform random value in [-0.05..0.05]    
Episode Termination:
    Pole Angle is more than 12 degrees
    Cart Position is more than 2.4 (center of the cart reaches the edge of the display)
    Episode length is greater than 200
    Solved Requirements
    Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.
```  
[Openai Gym CartPole-v1 Github](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)        

## Environments

Here's a minimum example of getting something running. This will running an instance of the `CartPole-v0` environment for 1000 timesteps, rendering the environment at each step. You should see a window pop up rendering the classic cart-poel problem:


```python
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random aciton
env.close()

Observations

The environment's step function returns exactly what we need. In fact, step returns four values. These are:

observation(object):

an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.

reward(float):

amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.

done(boolean):

whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)

info(dict):

diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

The following is an implementation of the classic "agent-environment loop". Each timestep, the agent choosees an acion, and the environment returns an observation and a reward.

import gym
env = gym.make("CartPole-v0")
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

Spaces

In the examples above, we'have been sampling random actions from the environment's action space. But what actually are those actions? Every environment comes with an action_space and an observation_space. These attributes are of type Space, and they describe the format of valid acitons and observations:

import gym
env = gym.make("CartPole-v0")
print(env.action_space)
print(env.observation_space)

Discrete(2)
Box(4,)

print(env.observation_space.high)

[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]

print(env.observation_space.low)

[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]

Play by yourself

import gym
from pyglet.window import key
import time

bool_do_not_quit = True
scores = []
a = 0

def key_press(k, mod):
    global bool_do_not_quit, a, restart
    if k == 0xff0d:
        restart = True
    if k == key.ESCAPE:
        bool_do_not_quit = False
    if k == key.Q:
        bool_do_not_quit = False
    if k == key.LEFT:
        a = 0
    if k == key.RIGHT:
        a = 1

def play_CartPole_yourself():
    env = gym.make("CartPole-v0")
    env.reset()
    env.render()
    env.viewer.window.on_key_press = key_press
    while bool_do_not_quit:
        env.reset()
        total_reward = 0.0
        steps = 0
        restart = False
        t1 = time.time()
        while bool_do_not_quit:
            observation, reward, done, info = env.step(a)
            time.sleep(0.1)
            total_reward += reward
            steps += 1
            env.render()
            #if done or restart:
            if restart:
                t1 = time.time() - t1
                scores.append(total_reward)
                print("Trial", len(scores), "| Score:", total_reward, "|", steps, "steps | %0.2fs."%t1)
                break
    env.close()
            
play_CartPole_yourself()