spaceinvaders_ai

Space Invaders AI using Q-Learning
Log | Files | Refs | README

commit 347ed4abeb98ab95c2fe3cd1f81c92126f28e49d
parent cf6c2196b189e163c44ae8ce3e30f7ac222cfa81
Author: John Kubach <johnkubach@gmail.com>
Date:   Tue, 28 Sep 2021 11:26:04 -0400

README

Diffstat:
AREADME | 172+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 172 insertions(+), 0 deletions(-)

diff --git a/README b/README @@ -0,0 +1,172 @@ +% Playing Classic Arcade Games Using Q-Learning +% John Kubach +% 12/15/20 + +# Concept + +Due to their relative simplicity and immediate feedback, older video games are +ideal for experiments regarding machine learning. Probably the most well know +example, the MarI/O[^1] project, uses deep Q-learning to learn and navigate the +levels of the classic game Super Mario World. My plan is to create a machine +learning system that is able to play classic arcade games such as Pacman, Space +Invaders, and Centipede. + +# Goals + +The end goal of this machine learning system will be to play an arcade game to a +similar level of proficiency to that of a casual human player. I also want the +system to be "fair". That is, it is not allowed to view code or memory states +that a human would not be able to see. + +# Methodology + +## MAME + +I used the MAME[^2] arcade emulator framework to run the games on my computer. +In addition to running these games, MAME allows for the viewing of the memory +states for a given game rom. Many of these games have already had their memory +states mapped and documented by passionate fans online. Lucky for me, Space +Invaders is a rather well documented game.[^3] + +In order to view the memory locations in my code, I used the MAMEToolkit +framework.[^4] Using MAMEToolkit, I was able to map core memory locations to +Python variables. In addition to memory viewing, MAMEToolkit allows you to step +through each frame of the game and output the values of the current memory +locations. This is helpful for a Q-Learning setup, as I can get immediate +feedback to each action performed. + +## Q-Learning + +The core of my learning agent is a Q-Learning algorithm. Though most modern +approaches tend to use a neural network of some sort, I was interested in how +far I could push a straight Q-Learning implementation. I opted for a +epsilon-greedy approach to my agent. The high-score driven nature of arcade +games seem to best fit this method of Q-Learning. + +The movements in Space Invaders can be mapped into four actions. + +1. Shoot +2. Left +3. Right +4. Idle + +These actions are pretty self-explanatory. The "Idle" action is equivalent to +not inputting any buttons or directions. This can be advantageous, as you do not +necessarily always want to be moving. Staying hidden behind a shield is a good +example of this. + +As for states, many can be made from this game. After some experimentation, I +found the following gave me the best results in training time and success rate. + +1. Score point (shoot alien) +2. Lose life (shot by alien) +3. "Nothing" +4. Shot currently moving (has not collided yet) +5. Shot hit something other than an alien (shield, ceiling) + +The nothing state signifies that none of the other states are currently +happening. Obviously, the game is still at play, but there is no player shot +currently out, and the agent hasn't been hit by an alien shot. I include "Shot +currently moving" as its own state because I do not want to preemptively +penalize the agent for a shot still on its way to the target. This way, I do not +discourage the agent from shooting. Shots hitting anything besides aliens result +in a minor penalty. The rational was that this would incentivize the agent to +place more accurate shots, rather that wildly shooting at all times. This +matters more in Space Invaders because only one player shot can be on the field +at a given time. + +I spent a good amount of time tweaking the Q-Learning variables, as well as +reward / penalty values. These are the final values I settled on. + + # q-learning variables + EPSILON = 0.5 + EPISODE_DECAY = 0.9998 + LEARNING_RATE = 0.1 + DISCOUNT = 0.95 + + # rewards / penalties + SHOOT_PENALTY = 0 + DEATH_PENALTY = 50 + MISS_PENALTY = 10 + IDLE_PENALTY = 1 + KILL_REWARD = 500 + +I found the most difficult part to be properly tuning the reward / penalty +system. It was easy to over-penalize and discourage the agent from making any +moves whatsoever. I originally played with a small penalty to shooting in +general, which would be rewarded back on an alien hit, but this cause the agent +to not take any risks and either not move or hide behind shields. + +# Analysis and Training + +Depending on variable settings, I found 20-50 full games to be an optimal amount +of training. For the agent, this results in about 3000-10000 episodes of +training. An episode being an event frame called from MAMEToolkit. I found that +depending on the variables, training time, and general randomness, the agent +would tend to develop certain "strategies". Listed are a few that I noticed. + +* Stay near corner and shoot +* Shoot & dodge +* Duck between shields +* Shoot once and don't move + +The first strategy usually arises if the agent gets shot immediately after +spawning too many times. The agent becomes hesitant to move beyond the first +shield. This behavior compounds if the agent is penalized for hitting the +shield. Played with moderation however, this is a valid strategy. This is +because a player, provided they are able to shoot the aliens, is able to keep +the alien row from bumping into the edge, preventing the aliens from coming +closer. + +The second strategy is the most standard seen by human players. The agent has a +tendency to dodge right before an alien shot hits it. Sometimes this works, +sometimes the agent is too late, and sometimes the agent dodges right into +another alien shot. + +Strategy number three is similar to the previous. The main difference is that +the agent fires a lot less shots, and prefers to place more precise shots while +hiding behind the shields. The is what I imagined the ideal outcome of the +rewards / penalties would be. + +The last strategy occurs when I over train the agent, the penalties are too +harsh, or both. The agent simply fires one shot and does nothing, eventually +getting shot by the aliens. + + +# Results + + +![Sample Q-Table After Training](q_table.png){ width=700px } + +![Average Reward Given - Epsilon = 0.5](Figure_1.png){ width=700px } + +![Average Reward Given - Epsilon = 0.7](Figure_2.png){ width=700px } + +![Average Reward Given - Epsilon = 0.1](Figure_3.png){ width=700px } + +![Average Reward Given - Epsilon = 0.5 No Miss Penalty](Figure_4.png){ width=700px } + +![Agent High Score - 590 points](high_score.png){ width=500px } + +\newpage + +From the above results, I found that keeping epsilon = 0.5 gave me the best +results. Epsilon = 0.7 was a bit more volatile, and seemed to work in the short +term. However, the increased randomness is less sustainable than Epsilon = 0.5. +Epsilon = 0.1 gave poor results. Not enough randomness left the agent picking +strategies that were less than ideal. + +Interestingly, removing the miss penalty lead to a strong start by the agent. +However, the average results began decaying after extended runs. If the training +time is moderated, this could be the strongest method. + +The final results met my goals. Space Invaders is low-scoring compared to other +games, so a score of 590 is respectable. A score like that is very much the +expected score for a beginner. In order to improve these results, I would likely +need to either add more states, convert the agent to a Deep Q-Learning network, +or both. + +[^1]: https://github.com/aleju/mario-ai +[^2]: https://www.mamedev.org/ +[^3]: https://www.computerarcheology.com/Arcade/SpaceInvaders/RAMUse.html +[^4]: https://github.com/M-J-Murray/MAMEToolkit