BrawlStars AI Series (Part 2) - Reinforcement Learning

1. Perception
- 1.1 Current Player Position
  - Green Circle
  - Player Name
- 1.2 Stars (Reward)
  - Player Stars
  - Team Stars
2. Planning
- 2.1 Agent
- 2.2 Brain
3. Error Analysis
4. Challenges & Future Steps

First and foremost, I must say, perception is harder than planning.

This part, I will be attempting to apply reinforcement learning (RL) towards creating an agent that can play Brawlstars. The goal of the project is similar to part 1 —be a decent player and excert human-like behaviors. My personal goal is to learn various reinforcement learning techniques and apply them towards a practical problem.

This is a challenging problem because：

I will not be using any Brawlstars APIs to retrieve information about the game state, everything we humans can see, will be everything the agent can see.
I will not be telling the agent about the game rules, what each game element means (shooting the wall), or about the objective of killing the opposing characters. I will only provide rewards just as if humans are playing the game and sees their stars increase at the top of their character when they kill their opponents.
Training is done in real-time as there is no simulator that can allow the agent to train faster than the actual time. Thus, it will be significantly slower than training an agent to play chess or go where a simulator is available.

In any RL problem definition, there are 3 components:

Action: Forward, backward, left, right, stand still (no-op), normal attack, super attack
Environment: The fixed map consists of 6 players (including the agent, 2 of which are allies, 3 enemies). Each of the other 5 players are controlled by Brawlstars built-in game AI.
Reward: There is a star icon above the player’s avatar denoting the player’s stars, this can be increased when killing opponents and reset to 2 stars when the agent is killed.

1. Perception

For perception, we are concerned with modeling the environment. In the context of Brawlstars, since we don’t have any access to backend APIs to retrieve information about the position, state and action of players, we will need to go the human route of capturing these information from the screen. We will convert the raw pixels into a feature vector and quantify the stars (reward) and player position.

1.1 Current Player Position

Green Circle

Initially, I used the green circle beneath the player to detect its position. By performing supervised training on the set of labeled images for all game modes, I was able to get a rough object detection classifier working. However, the circle get’s easily distracted by other background elements or even other player’s elements. Considering the number of labeled images I could manually create, I should have only focused on one game mode (and map) so the variation wouldn’t be that high. Nevertheless, this approach was not very accurate.

Green Circle

Sorry for the green background, it must have been compression.

Green Circle 2

Player Name

Then I realized the player’s name is always in front of every other element, at least 90% of the time (the rest 10% is when explosion elements take over the screen). I extracted out my player’s name and used template matching for detecting the player’s position.

Name_Template

Sorry for the green background, it must have been compression.

Name Detection

1.2 Stars (Reward)

Player Stars

This is the most direct form of reward. You kill one opponent, you gain one star. The max stars is capped at 7. If you die, your stars get reset to 2.

For initial training, I used player stars as the sole reward. (i.e. x stars = x reward)

Team Stars

This is a high-abstraction reward, since not only will the performance of the agent but also the other 2 teammates will directly affect the number of team stars. Note that dying will not decrease the number of team stars.

I created the reference digits [0-9] to be used for template matching.

RefDigits

Player Team Star Detection

2. Planning

After I have made some progress on the perception problem, we know where our current player is, as well as the number of player and team stars we have. This section is dedicated to solving the planning problem to a certain degree.

I used Double Deep Q Network w/ Experience Replay to approximate value functions to identify the value of performing a certain action in any given state. As for why not vanilla Q-learning, you can read up Experience Replay and Double Q-Learning.

As for why Q-learning (or value-based approach): The intuition here is that since the game board is fixed, the objective is fairly straight-forward, there will be lots of cases where the same state will be given (same enemy at the same distance away from current player) and the same action (attack or super attack) will need to be performed to increase the reward (gain stars). Therefore, having a value for each state-action pair will be helpful.

2.1 Agent

The agent acts based on the output q-value. The q-value represents the value of a particular state-action pair. Out of all the possible actions, it picks the one that has the highest q-value, separately for action and movement. An epsilon value dictates the trade-off between exploration and exploitation to ensure that we are still exploring the environment. The agent also perceives the state, rewards and stores them into the “Experience Buffer” for further sampling and replay for training the “Brain”.

Hyper Parameters:

Learning Rate
Initial Epsilon
Final Epsilon
Epsilon Decay
Gamma (Discount factor for Q value)

2.2 Brain

Initially, I use 4 simple two-layer neural network (NN) to represent the brain and to approximate the q-values for the following:

Movement (Target q-network, Q-network)
Attack (Target q-network, Q-network)

Why 4, not just 2? This is to avoid the overestimation of Q-values problem, I used two NNs per action type, one being the target network and the other is the main q-network

Input: The features extracted from MobileNet

Output: Approximated Q-values (state-action values)

state_input -> relu activation -> drop out -> relu activation -> drop out
             |_________ Layer 1 ____________|_________ Layer 2 _________|
-> q-value

3. Error Analysis

After watching the agent play during it’s training process. I’ve noticed several problems that are very obvious to the human eye, but not quite obvious to the agent and may take a very long time for the agent to improve. Below are some of these problems:

Initially, the agent spams the movement and attack keys randomly, which is expected due to the EpsilonGreedy approach starting with 100% randomness slowly decaying to around 5% at the end of the training process. However, the obvious problem is how fast the agent can learn to navigate properly (not walking into walls) versus how slow the agent can learn to not constantly waste its ammo. It would be helpful to somehow build in the concept/model of ammo into its state so the relationship between ammo and attack actions can be better coordinated.

E.g. At around 140 Episodes, the agent is still firing its attacks pretty much whenever it’s available (once very 0.7-0.8s). But it’s able to walk continuously in a straight path, suddenly stopping (pressing no keys) and be able to take different straight path towards the enemy targets.

4. Challenges & Future Steps

The current agent’s training speed is bounded by the game play speed. In other words, there’s no simulator that can speed up the training process, and I can’t alter the game speed by any means. So one second of game play = one second of actual training time. This has always been a constraint in my own learning as well, since the slower the training goes, the slower I can identify problems in my approach.
Since the sequence of frames (states) and actions are both important for the agent to learn. Some sequences have more value of learning where there’s lots of game mechanics involved, than others where the agent is just waiting for resurrection after being killed. A prioritized experience replay buffer would help to address this challenge.
Since this game is a online game, and I don’t have a fixed game that I can experiment, as the game keeps updating overtime, it would be equivalent to shooting a moving target if I keep maintaining this. Therefore, I decided to discontinue this project. This project was based on Brawlstars version 16.176. It was overall a great experience to try to apply machine learning on a game I enjoy playing personally, I really learnt a lot.

On to the next challenge.

- Henry