Code

Here

1. Goal

The problem setting is to solve the Continuous MountainCar problem in OpenAI gym.

test-run

2. Environment

The mountain car follows a continuous state space as follows(copied from wiki):

The acceleration of the car is controlled via the application of a force which takes values in the range [1, 1]. The states are the position of the car in the horizontal axis on the range [1.2, 0.6] and its velocity on the range [0.07, 0.07]. The goal is to get the car to accelerate up the hill and get to the flag.


State

Num Observation Min Max
0 Car Position -1.2 0.6
1 Car Velocity -0.07 0.07

Note that velocity has been constrained to facilitate exploration, but this constraint might be relaxed in a more challenging version.


Actions

Num Action
0 Push car to the left (negative value) or to the right (positive value)


Reward

Reward is 100 for reaching the target of the hill on the right hand side, minus the squared sum of actions from start to goal.

This reward function raises an exploration challenge, because if the agent does not reach the target soon enough, it will figure out that it is better not to move, and won’t find the target anymore.

Note that this reward is unusual with respect to most published work, where the goal was to reach the target as fast as possible, hence favouring a bang-bang strategy.


Starting State

Position between -0.6 and -0.4, null velocity.


Episode Termination

Position equal to 0.5. A constraint on velocity might be added in a more challenging version.

The episode will terminate either when the car has reached the goal OR when the total number of time steps reached 1000 regardless of reaching the goal or not.


Solved Requirements

Get a reward over 90. This value might be tuned.


3. Approach

The approach uses the policy gradient algorithm with a baseline to reduce variance. Even though the state space is continuous, in this attempt, we will be using a discrete softmax policy. In other words, the continuous state space will be discretized into buckets of states that will be fed to the agent that will output a discrete action either [-1, 0, 1], which is [left, no-action, right].


3.1 Discretization

Since there are two dimensions in the state space, namely position and velocity. We will discretize them separately into 150 buckets and 120 buckets for position and velocity respectively.


3.2 Exploration vs Exploitation

To overcome the exploration-exploitation dilemma, we will be using the epsilon-greedy approach to slowly decrease the randomization factor overtime. This will ensure that our agent will have a wide variety of state-action training samples and in the later part of the training, it will allow the agent to follow it’s own “trained strategy” as opposed to random actions.

Technically, in the code, we will be using a temperature term to smooth the probability of actions, and epsilon to decide between whether to take a random action or the predicted action output from the policy.


3.3 Training

Monte Carlo

The training process follows a Monte Carlo method. This means the training only takes place after an entire episode is completed, and replays the accumulated state/action/reward/next state for training. This is at one end of the spectrum, the other end of the spectrum is called 1-Step Temporal Difference learning. So Monte Carlo is essentially a \(\infty\)-step Temporal Difference learning. There is a balance between when you want to train the agent. Training it too early could render very messy results and thus might be harder to converge. Training it too late might prolong the training duration for convergence. The criteria for choosing the method depends heavily on the problem itself.

[1] Richard S. Sutton and Andrew G. Barto. 2018. *Reinforcement Learning: An Introduction*. A Bradford Book, Cambridge, MA, USA.


Policy Gradient Weight Update \(\alpha_v = \text{Value function learning rate}\\ \alpha_p = \text{Policy function learning rate}\\ \hat{v} = \text{Estimated value for the given state} \\ \theta_v = \text{Value function parameterization weights}\\ \theta_p = \text{Policy function parameterization weights}\\ \delta = \text{Advantage}\\ \gamma = \text{Discount rate}\\\ G = \text{Discounted Rewards}\)

\[\Large \delta \leftarrow G - \hat{v}(S, \theta_{v})\\ \Large \theta_{v} \leftarrow \theta_{v} + \alpha_v \delta \triangledown \hat{v}(S, \theta_{v}) \\ \Large \theta_{p} \leftarrow \theta_{p} + \alpha_p \delta \gamma \triangledown \ln{\pi(A \mid S, \theta_p)}\\\]

Gradient Calculation for Softmax Policy \(a = \text{action}\\ a' = \text{selected action by policy}\\ w = \text{policy weights}\\ \tau = \text{temperature for epsilon-greedy}\\ s = \text{current state} \\ \pi(a \mid s) = \text{policy outputs an action given the current state} \\ \Large \quad \quad \quad = \frac{e^{w_{s,a}}}{\sum_{a'}e^{w_{s,a'}}}\)

\[\Large \begin{align*} \triangledown_w \log \pi(a \mid s) &= \frac{\partial \log \frac{e^{w_{s,a}} / \tau}{\sum_{a'}e^{w_{s,a'}} / \tau}}{\partial w} \\ \newline \\ \text{(Take logs)} \; &= \frac{\partial \log \big(e^{w_{s,a}} / \tau\big) - \partial \log \big (\sum_{a'}e^{w_{s,a'}} / \tau \big)} {\partial w}\\ \newline \\ \text{(Chain & Log rule)} &= \frac{e^{w_{s, a}} / \tau w_{s,a} / \tau}{e^{w_{s,a}} / \tau \ln{e}} - \frac{\sum_{a'} w_{s,a'} / \tau \; e^{w_{s,a'}/ \tau}}{\sum_{a'}e^{w_{s,a'}}/ \tau}\\ \newline \\ \text{(Simplify first equation)} \; &= w_{s,a} / \tau - \frac{\sum_{a'} w_{s,a'} / \tau \; e^{w_{s,a'}/ \tau}}{\sum_{a'}e^{w_{s,a'}}/ \tau}\\ \newline \\ % \text{(Second equation substitute } \pi(a \mid s) \text{)} \; &= w_{s,a} / \tau - \sum_{a'} w_{s,a'} / \tau \; \pi(a' \mid s)\\ \text{(If Action = $a^\prime$) } &= (1 - \frac{e^{w_{s, a}}}{\sum_{a'}e^{w_{s,a'}}}) / \tau \\ &= \frac{1 - \pi(a \mid s)}{\tau} \\ \text{(If Action $\neq a^\prime$) } &= -(\frac{e^{w_{s, a}}}{\sum_{a'}e^{w_{s,a'}}}) / \tau \\ &= - \frac{\pi(a \mid s)}{\tau} \\ \end{align*}\]

3.4 Hyperparameters

  • Discount rate (\(\gamma\)): 0.999
  • Value function learning rate (\(\alpha_v\)): \(1e^{-2}\)
  • Policy function learning rate (\(\alpha_p\)): \(1e^{-3}\)
  • Temperature (\(\tau\)): 1 decreasing to 0.5 linearly over all training episodes
  • Discretization for position state: 150 buckets
  • Discretization for velocity state: 120 buckets
  • Epsilon-greedy (\(\epsilon\)): 1 decreasing to 0.1 linearly over all training episodes


4. Experiment & Findings

# Episodes Training Time Min. Reward Max Reward
2500 473 seconds 31.7 90.6
1500 327 seconds 31.5 89.9
1000 254 seconds 31.2 89.3
500 154 seconds 30.2 86.7
200 77 seconds 31.1 84.6


4.1 Introducing baseline to reduce variance

Without Baseline

without_baseline

Training took: 280.78 seconds

With Baseline

with_baseline

Training took: 253.72 seconds

The above comparison experiment is to show the effects of having a baseline on the agent’s performance. We can easily see from the rewards graph, the agent with baseline has a smaller variance in the rewards. This translates to a faster convergence rate.


4.2 Discrete vs Continuous Actions

This problem (MountainCarContinuous-v0) was intended to be solved using a continuous action policy. However, I didn’t use continuous actions because I wanted to see how well a discrete-action agent could perform on this simple task. In conclusion, the number of episodes until convergence is quite good. The overall max reward reaches the goal of 90 at around ~1000 episodes.


4.3 Performance

Let’s look at the leaderboard of this problem and MountainCar-v0, which is a discrete version of the problem.

MountainCar-v0  
User Episodes before solve
Zhiqing Xiao 0 (use close-form preset policy)
Keavnn 47
Zhiqing Xiao 75
Mohith Sakthivel 90
Anas Mohamed 341
Harshit Singh Lodha 643
Colin M 944
jing582 1119
DaveLeongSingapore 1967
Pechckin 30
Amit 1000-1200
Gleb I 100
MountainCarContinuous-v0  
User Episodes before solve
Ashioto 1
Keavnn 11
camigord 18
Tobias Steidle 32
lirnli 33
khev 130
Sanket Thakur 140
Pechckin 1
Nikhil Barhate 200 (HAC)

This shows that our discrete policy is a good baseline to start with before we dive into continuous actions. With continuous actions, debugging might be slightly harder as the policy function will be slightly more complicated, perhaps using a neural network approximating the policy function.


5. Next Steps

An obvious next step would be to try out continuous actions on this same problem and see how much more “efficient” and “effective” training can be. Intuitively, with continuous actions, you would expect a more fine-grain control of the car, and thus less “bouncing” around. Think driving with full petal and full brake, it’s not a very effective way to drive, is it?

Another thing that could be interesting to try out is to change the “n” in the n-step Temporal Difference (TD) learning and see the effects with regards to performance, convergence rate, reward variance. Right now we are using Monte Carlo, which is essentially a \(\infty\)-step TD method.

The current approach uses policy gradient as the approach to train the agent. There are some limitations in policy-based methods. There are other reinforcement learning algorithms that can be used to tackle this problem such as Deep Q-Learning, and Actor-Critic. I will be tackling another OpenAI Gym problem (Pendulum-v0) using actor-critic. Stay tuned for my next post.