Code

Here


1. Goal

The problem setting is to solve the Inverted Pendulum problem in OpenAI gym. Try to keep a frictionless pendulum standing up.

test-run

2. Environment

The pendulum environment has a continuous state space as follows (copied from wiki):


State

Type: Box(3)

Num State Observation Min Max
0 Angle1 cos(theta) -1.0 1.0
1 Angle2 sin(theta) -1.0 1.0
2 Velocity theta dot -8.0 8.0


Actions

Type: Box(1)

Num Action Min Max
0 Joint effort -2.0 2.0


Reward

The precise equation for reward:

\[-(\theta^2 + 0.1*\theta_{dot}^2 + 0.001*\text{action}^2)\]

Theta is normalized between -pi and pi. Therefore, the lowest reward is \(-(\pi^2 + 0.1*8^2 + 0.001*2^2) = -16.2736044\), and the highest reward is \(0\). In essence, the goal is to remain at zero angle (vertical), with the least rotational velocity, and the least effort.


Starting State

Random angle from \(-\pi\) to \(\pi\), and random velocity between -1 and 1


Episode Termination

There is no specified termination. Adding a maximum number of steps might be a good idea.

NOTE: Your environment object could be wrapped by the TimeLimit wrapper, if created using the gym.make method. In that case it will terminate after 200 steps.


Solved Condition

There are no specific requirements, see the experiments section for a comparison of performance with the leaderboard.


3. Approach

The approach uses the one-step actor-critic (episodic) algorithm with temporal difference for estimating the advantage function for the critic. The policy will be a continuous one, namely the gaussian function.


3.1 Discretization

Since the state space is continuous and is comprised of angle of the pendulum and the velocity we will discretize them separately using a radial basis function (RBF) as follows:

  • Angle1 and Angle2 will each be discretized into 15 radial basis kernels

  • Pendulum velocity will be discretized into 15 radial basis kernels.

  • Essentially, each “transformed” state will be computed by measuring how far the raw state is from each of the 15 radial basis kernel centers.

Therefore, in total, there will be \(15^3=3375\) states.


3.2 Exploration vs Exploitation

To overcome the exploration-exploitation dilemma, we will be slowly decreasing the standard deviation of the gaussian function. This will allow the agent’s action to vary a lot initially for exploration. At later stages of training, since the standard deviation is smaller, the actions will tend to be closer to the estimated gaussian mean, thus exploiting the learnt policy weights for a more accurate action selection.


3.3 Gaussian Policy

To generate continuous actions for this problem, the gaussian policy is a simple and good baseline to start off with.

\(\Large \pi(a \mid s, \theta) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(a - \mu(s, \theta))^2}{2\sigma^2}}\) We can see that there are three variables that we need to provide for the policy to generate an action: \(\mu = \text{The mean given the state and weights, this will be approximated by a function approximator}\\ \sigma = \text{The standard deviation parameter for this policy} \\ \theta = \text{The weights of the policy that will be trained}\)


3.4 Linear Value Function

Not to complicate our approach, our value function is defined to be a function approximator that has the same number of parameters as the number of states, which is 3375. The prediction of the value given the state is just a dot product between the current weights and the given state.


3.5 Training

As the name suggest, the agent is trained one step at a time, which means the learning happens after every timestep (max 200 timestep make up one episode).


Policy (Actor) and Value (Critic) Function Weight Update

\[\Large \begin{align*} \alpha_v &= \text{Value function learning rate}\\ \alpha_p &= \text{Policy function learning rate}\\ \hat{v} &= \text{Estimated value for the given state} \\ \theta_v &= \text{Value function parameterization weights}\\ \theta_p &= \text{Policy function parameterization weights}\\ \delta &= \text{Advantage}\\ \gamma &= \text{Discount rate}\\ I &= \text{Policy parameter scaling factor, initially 1} \\ R &= \text{Reward}\\ \\ \delta &\leftarrow R + \gamma \hat{v}(S', \theta_v) - \hat{v}(S, \theta_v)\\ \theta_v &\leftarrow \theta_v + \alpha_v \delta \triangledown \hat{v}(S, \theta_v) \\ \theta_p &\leftarrow \theta_p + \alpha_p I \delta \triangledown \ln{\pi(A \mid S, \theta_p)} \\ I &\leftarrow \gamma I \\ \end{align*}\]


3.6 Hyperparameters

  • Discount rate (\(\gamma\)): 0.99
  • Actor learning rate (\(\alpha_v\)): \(1e^{-4}\)
  • Critic learning rate (\(\alpha_p\)): \(5e^{-3}\)
  • Gaussian Policy standard deviation (\(\sigma_p\)): 0.5 decreasing to 0.1 linearly over all training episodes
  • Radial Basis Function Kernel Width (\(\sigma_{rbf}\)): 0.1
  • Total number of RBF kernels: 15 for each dimension of the state.


4. Experiment & Findings

# Episodes Training Time Avg Reward (last 100 episodes)
2000 213 seconds -146
1000 117 seconds -151
500 76 seconds -215
200 23.55 seconds -1062

500ep

2000ep

We can see that there is a significant improvement in performance after training for 1000 episodes. And by looking at the rewards graph, for the 2000 episode training instance, it has converged to > -200 rewards at around 500 episodes. This means that perhaps a better initial and decay value for \(\sigma_p\) can be chosen to speed up the convergence rate in shorter training sessions.

It also shows that having 15 RBF kernels for each state dimension is enough for discretizing the state space. Since this hyperparameter directly affects the training speed and memory usage during training, it is often necessary to revisit this when it becomes a bottleneck.

Compared to the leaderboard below, our performance is within the range of the average users.

User Best 100-episode performance
msinto93 -123.11 ± 6.86
msinto93 -123.79 ± 6.90
heerad -134.48 ± 9.07
BS Haney -135
ThyrixYang -136.16 ± 11.97
lirnli -152.24 ± 10.87


5. Next Steps

Fine tune the initial and decay value for \(\sigma_p\) and see its effect on the convergence rate and overall performance.

Remember that we’re still using a very simple linear function approximator for our value function. It works pretty well even with this simple baseline setup, which makes me wonder how much of an improvement will we have if we try something more powerful.

Try to shrink the state discretization to less RBF kernels, and see if there’s a significant reduction in training efficiency or effectiveness.

Overall, this was a very shallow dive into actor-critic, and there are far more intricacies to this method. I will try to explore other alternatives with the OpenAI gym. Stay tuned.