Creating a Policy Gradient (PG) Agent to Trade

Policy Gradient
Problem Setting
- Agent
- Environment
- Reward
Technical Details

This is the first post that’s part of the series for teaching an agent to trade. I will evaluate different reinforcement learning (RL) approaches and share some findings along the way. The goal of the series is to learn RL by applying it on an actual problem that I can relate to.

Policy Gradient

Policy gradient is a policy-based approach, where the goal of the training process is to develop a policy that maximizes the reward the agent receives overtime. The other approach is value-based, which basically tries to develop a value function that outputs the value (goodness) of choosing a particular action in given a state.

Problem Setting

High-level overview

Agent

At the start of each time step in each episode, the agent is presented with a state (see environment section for details on the state). The agent initially randomly selects different actions (buy, sell, hold) and observes its outcomes to determine which actions it should choose or avoid later. At the end of each episode, the agent is trained on the entire dataset to improve its policy.

Environment

The environment consists of states and actions. Each state is a set of prices (high/low/current) at any given point in time. There’s an option to use 10-second raw data or 1-minute data. I’ll be sticking to 1-minute intervals as there’s less noise compared to the 10-second data which looks like ping-pong and we’re not going for high-frequency trading (HFT) here. There are three possible actions: buy, sell or hold.

Reward

This is the hardest to design. Naively, I used the unrealized profit/loss (market value) of the entire portfolio (including cash) as the reward. Note that there’s no point in scaling up/down the reward as the significance gets taken into account by the market value anyways. In the later part of the series, I will go into details on the different reward functions that I’ve designed for trading.

Technical Details

Policy Network Design

In policy gradient, we are trying to approximate a good policy using a neural network. As a result, I implemented a simple 3-layer neural network.

Training

Note that in policy gradient, there is no traditional loss for a given sample. We use our reward to help us come up with a loss for a particular sample, and help update our neural network weights.

E.g. If we chose action 1 in our forward propagation pass, the update rule will update the weights of the neural network so that action 1 can be more/less likely to be chosen in the next iteration. This is dependent on our reward. If reward (V_t) is high, the update value (∇) will increase, and vice versa. See the formal update rule below:

Reference: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf

Key Considerations

How many neurons per hidden layer. I’ve noticed that using very small number of neurons in the second-last layer will result in nothing being learned, regardless of which activation function we use in between. Once I’ve increased the number of neurons in the second-last layer to 16, the loss started to decrease.
Activation function for each layer of the NN. Since the policy is trying to decide between 3 different actions. We can treat this problem as a multi-classification problem. As a result, I used the expotential linear unit for intermediary hidden layers, and the softmax function in the output layer.
Discount factor for future rewards. The intuitive thought would be to think of this from a investment/time perspective, in other words, the time value of money. Since we’re dealing with 1-minute increments of rewards, the risk-free interest rate (treasury rate) for 1 year is around 2.4%, the 1-minute rate would calculated to be:

\(0.024 / 365 / 24 / 60 = .000000046 = 4.6 \times 10^{-8}\) Therefore, the discount factor should be \(1 - 4.6 \times 10^{-8} = 0.999999954\).
Learning rate. This is a hyperparameter that we should tune. However, for the purpose of this short tutorial on PG, I will use a learning rate of \(4 \times 10^{-5}\) that I found to be good without going into the details.

Key Challenges

Challenge 1

During training, I’ve noticed that the training loss is stuck at 0 for many episodes. After some investigation, I realized that it was because the activation function I chose to be ReLU behaved in a interesting way when the learning rate was too high. In other words, some of the neurons were “permanently dead” after passing through the ReLU activation function. This meant that learning stopped for those neurons, and thus the entire training session was useless. Formally, this is called the “Dying ReLU” problem, and I can’t believe I personally encoutered it. I overcame the challenge by using expotential linear unit (ELU) instead.

Source: https://medium.com/tinymind/a-practical-guide-to-relu-b83ca804f1f7

A “dead” ReLU always outputs the same value (zero as it happens, but that is not important) for any input. Probably this is arrived at by learning a large negative bias term for its weights.

In turn, that means that it takes no role in discriminating between inputs. For classification, you could visualise this as a decision plane outside of all possible input data.

Once a ReLU ends up in this state, it is unlikely to recover, because the function gradient at 0 is also 0, so gradient descent learning will not alter the weights. “Leaky” ReLUs with a small positive gradient for negative inputs (y=0.01x when x < 0 say) are one attempt to address this issue and give a chance to recover.

Reference: https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks

Challenge 2

To measure the effectiveness of the agent, I constantly monitor the loss of the training as well as the “mean_reward” per episode to see if there’s a upward trend. However, the loss seem to fluctuate with (high variance) with no clear upward or downward trend.

To understand why this is inherently a challenge for our problem let’s revisit two things:

How policy gradient learns the policy. The agent initially randomly chooses actions, and based on the reward that this action yielded at this state, the agent encourages/discourages the action chosen so it will be more/less likely to be chosen next time the agent sees the same state.
The state definition in our problem statement. We are defining each state to be a set of prices at one point in time. Combining this fact with (1) that the agent learns by observing the reward at a given state, we can easily see that a state doesn’t tell the agent that the price is in a downward trend or a upward trend.

E.g. A price of 70 can either be part of a upward trend from 65 to 75, or a downward trend from 80 to 60. The agent initially bought at 70 (by random). Since this is in a upward trend from 65 to 75, buying at 70 yielded a reward of 5. This positive reward reinforced the agent to take the “buy” action whenever it sees state containing the price 70. However, the next time the agent encounters the same state 70, it utilizes the knowledge learned from last time and still buys, resulting in a reward of -15. This time the state 70 is part of a downward trend from 80 to 60.

Perhaps if we utilize a recurrent neural network or long short-term memory network, we could incorporate the sequence information that could potentially help the agent make better decisions. But that’s for another time.

I can’t believe I spent 5 days on this last challenge, because I thought there was a bug in the algorithm. But I eventually revisited the fundamentals and came to this realization.

Results

Due to the inherent nature of vanilla policy gradient, this problem setting wasn’t “solved” so there are no fancy profit curves accomplished by the agent. However, the learning was invaluable to me. Feel free to check out the code here

Next Steps

I’ll be continuing on my journey with applying RL towards trading. The next step is to try out Q-learning. Stay tuned…