writeup.md (7347B)
1 % Playing Classic Arcade Games Using Q-Learning 2 % John Kubach 3 % 12/15/20 4 5 # Concept 6 7 Due to their relative simplicity and immediate feedback, older video games are 8 ideal for experiments regarding machine learning. Probably the most well know 9 example, the MarI/O[^1] project, uses deep Q-learning to learn and navigate the 10 levels of the classic game Super Mario World. My plan is to create a machine 11 learning system that is able to play classic arcade games such as Pacman, Space 12 Invaders, and Centipede. 13 14 # Goals 15 16 The end goal of this machine learning system will be to play an arcade game to a 17 similar level of proficiency to that of a casual human player. I also want the 18 system to be "fair". That is, it is not allowed to view code or memory states 19 that a human would not be able to see. 20 21 # Methodology 22 23 ## MAME 24 25 I used the MAME[^2] arcade emulator framework to run the games on my computer. 26 In addition to running these games, MAME allows for the viewing of the memory 27 states for a given game rom. Many of these games have already had their memory 28 states mapped and documented by passionate fans online. Lucky for me, Space 29 Invaders is a rather well documented game.[^3] 30 31 In order to view the memory locations in my code, I used the MAMEToolkit 32 framework.[^4] Using MAMEToolkit, I was able to map core memory locations to 33 Python variables. In addition to memory viewing, MAMEToolkit allows you to step 34 through each frame of the game and output the values of the current memory 35 locations. This is helpful for a Q-Learning setup, as I can get immediate 36 feedback to each action performed. 37 38 ## Q-Learning 39 40 The core of my learning agent is a Q-Learning algorithm. Though most modern 41 approaches tend to use a neural network of some sort, I was interested in how 42 far I could push a straight Q-Learning implementation. I opted for a 43 epsilon-greedy approach to my agent. The high-score driven nature of arcade 44 games seem to best fit this method of Q-Learning. 45 46 The movements in Space Invaders can be mapped into four actions. 47 48 1. Shoot 49 2. Left 50 3. Right 51 4. Idle 52 53 These actions are pretty self-explanatory. The "Idle" action is equivalent to 54 not inputting any buttons or directions. This can be advantageous, as you do not 55 necessarily always want to be moving. Staying hidden behind a shield is a good 56 example of this. 57 58 As for states, many can be made from this game. After some experimentation, I 59 found the following gave me the best results in training time and success rate. 60 61 1. Score point (shoot alien) 62 2. Lose life (shot by alien) 63 3. "Nothing" 64 4. Shot currently moving (has not collided yet) 65 5. Shot hit something other than an alien (shield, ceiling) 66 67 The nothing state signifies that none of the other states are currently 68 happening. Obviously, the game is still at play, but there is no player shot 69 currently out, and the agent hasn't been hit by an alien shot. I include "Shot 70 currently moving" as its own state because I do not want to preemptively 71 penalize the agent for a shot still on its way to the target. This way, I do not 72 discourage the agent from shooting. Shots hitting anything besides aliens result 73 in a minor penalty. The rational was that this would incentivize the agent to 74 place more accurate shots, rather that wildly shooting at all times. This 75 matters more in Space Invaders because only one player shot can be on the field 76 at a given time. 77 78 I spent a good amount of time tweaking the Q-Learning variables, as well as 79 reward / penalty values. These are the final values I settled on. 80 81 # q-learning variables 82 EPSILON = 0.5 83 EPISODE_DECAY = 0.9998 84 LEARNING_RATE = 0.1 85 DISCOUNT = 0.95 86 87 # rewards / penalties 88 SHOOT_PENALTY = 0 89 DEATH_PENALTY = 50 90 MISS_PENALTY = 10 91 IDLE_PENALTY = 1 92 KILL_REWARD = 500 93 94 I found the most difficult part to be properly tuning the reward / penalty 95 system. It was easy to over-penalize and discourage the agent from making any 96 moves whatsoever. I originally played with a small penalty to shooting in 97 general, which would be rewarded back on an alien hit, but this cause the agent 98 to not take any risks and either not move or hide behind shields. 99 100 # Analysis and Training 101 102 Depending on variable settings, I found 20-50 full games to be an optimal amount 103 of training. For the agent, this results in about 3000-10000 episodes of 104 training. An episode being an event frame called from MAMEToolkit. I found that 105 depending on the variables, training time, and general randomness, the agent 106 would tend to develop certain "strategies". Listed are a few that I noticed. 107 108 * Stay near corner and shoot 109 * Shoot & dodge 110 * Duck between shields 111 * Shoot once and don't move 112 113 The first strategy usually arises if the agent gets shot immediately after 114 spawning too many times. The agent becomes hesitant to move beyond the first 115 shield. This behavior compounds if the agent is penalized for hitting the 116 shield. Played with moderation however, this is a valid strategy. This is 117 because a player, provided they are able to shoot the aliens, is able to keep 118 the alien row from bumping into the edge, preventing the aliens from coming 119 closer. 120 121 The second strategy is the most standard seen by human players. The agent has a 122 tendency to dodge right before an alien shot hits it. Sometimes this works, 123 sometimes the agent is too late, and sometimes the agent dodges right into 124 another alien shot. 125 126 Strategy number three is similar to the previous. The main difference is that 127 the agent fires a lot less shots, and prefers to place more precise shots while 128 hiding behind the shields. The is what I imagined the ideal outcome of the 129 rewards / penalties would be. 130 131 The last strategy occurs when I over train the agent, the penalties are too 132 harsh, or both. The agent simply fires one shot and does nothing, eventually 133 getting shot by the aliens. 134 135 136 # Results 137 138 139 ![Sample Q-Table After Training](q_table.png){ width=700px } 140 141 ![Average Reward Given - Epsilon = 0.5](Figure_1.png){ width=700px } 142 143 ![Average Reward Given - Epsilon = 0.7](Figure_2.png){ width=700px } 144 145 ![Average Reward Given - Epsilon = 0.1](Figure_3.png){ width=700px } 146 147 ![Average Reward Given - Epsilon = 0.5 No Miss Penalty](Figure_4.png){ width=700px } 148 149 ![Agent High Score - 590 points](high_score.png){ width=500px } 150 151 \newpage 152 153 From the above results, I found that keeping epsilon = 0.5 gave me the best 154 results. Epsilon = 0.7 was a bit more volatile, and seemed to work in the short 155 term. However, the increased randomness is less sustainable than Epsilon = 0.5. 156 Epsilon = 0.1 gave poor results. Not enough randomness left the agent picking 157 strategies that were less than ideal. 158 159 Interestingly, removing the miss penalty lead to a strong start by the agent. 160 However, the average results began decaying after extended runs. If the training 161 time is moderated, this could be the strongest method. 162 163 The final results met my goals. Space Invaders is low-scoring compared to other 164 games, so a score of 590 is respectable. A score like that is very much the 165 expected score for a beginner. In order to improve these results, I would likely 166 need to either add more states, convert the agent to a Deep Q-Learning network, 167 or both. 168 169 [^1]: https://github.com/aleju/mario-ai 170 [^2]: https://www.mamedev.org/ 171 [^3]: https://www.computerarcheology.com/Arcade/SpaceInvaders/RAMUse.html 172 [^4]: https://github.com/M-J-Murray/MAMEToolkit