commit 347ed4abeb98ab95c2fe3cd1f81c92126f28e49d
parent cf6c2196b189e163c44ae8ce3e30f7ac222cfa81
Author: John Kubach <johnkubach@gmail.com>
Date: Tue, 28 Sep 2021 11:26:04 -0400
README
Diffstat:
A | README | | | 172 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
1 file changed, 172 insertions(+), 0 deletions(-)
diff --git a/README b/README
@@ -0,0 +1,172 @@
+% Playing Classic Arcade Games Using Q-Learning
+% John Kubach
+% 12/15/20
+
+# Concept
+
+Due to their relative simplicity and immediate feedback, older video games are
+ideal for experiments regarding machine learning. Probably the most well know
+example, the MarI/O[^1] project, uses deep Q-learning to learn and navigate the
+levels of the classic game Super Mario World. My plan is to create a machine
+learning system that is able to play classic arcade games such as Pacman, Space
+Invaders, and Centipede.
+
+# Goals
+
+The end goal of this machine learning system will be to play an arcade game to a
+similar level of proficiency to that of a casual human player. I also want the
+system to be "fair". That is, it is not allowed to view code or memory states
+that a human would not be able to see.
+
+# Methodology
+
+## MAME
+
+I used the MAME[^2] arcade emulator framework to run the games on my computer.
+In addition to running these games, MAME allows for the viewing of the memory
+states for a given game rom. Many of these games have already had their memory
+states mapped and documented by passionate fans online. Lucky for me, Space
+Invaders is a rather well documented game.[^3]
+
+In order to view the memory locations in my code, I used the MAMEToolkit
+framework.[^4] Using MAMEToolkit, I was able to map core memory locations to
+Python variables. In addition to memory viewing, MAMEToolkit allows you to step
+through each frame of the game and output the values of the current memory
+locations. This is helpful for a Q-Learning setup, as I can get immediate
+feedback to each action performed.
+
+## Q-Learning
+
+The core of my learning agent is a Q-Learning algorithm. Though most modern
+approaches tend to use a neural network of some sort, I was interested in how
+far I could push a straight Q-Learning implementation. I opted for a
+epsilon-greedy approach to my agent. The high-score driven nature of arcade
+games seem to best fit this method of Q-Learning.
+
+The movements in Space Invaders can be mapped into four actions.
+
+1. Shoot
+2. Left
+3. Right
+4. Idle
+
+These actions are pretty self-explanatory. The "Idle" action is equivalent to
+not inputting any buttons or directions. This can be advantageous, as you do not
+necessarily always want to be moving. Staying hidden behind a shield is a good
+example of this.
+
+As for states, many can be made from this game. After some experimentation, I
+found the following gave me the best results in training time and success rate.
+
+1. Score point (shoot alien)
+2. Lose life (shot by alien)
+3. "Nothing"
+4. Shot currently moving (has not collided yet)
+5. Shot hit something other than an alien (shield, ceiling)
+
+The nothing state signifies that none of the other states are currently
+happening. Obviously, the game is still at play, but there is no player shot
+currently out, and the agent hasn't been hit by an alien shot. I include "Shot
+currently moving" as its own state because I do not want to preemptively
+penalize the agent for a shot still on its way to the target. This way, I do not
+discourage the agent from shooting. Shots hitting anything besides aliens result
+in a minor penalty. The rational was that this would incentivize the agent to
+place more accurate shots, rather that wildly shooting at all times. This
+matters more in Space Invaders because only one player shot can be on the field
+at a given time.
+
+I spent a good amount of time tweaking the Q-Learning variables, as well as
+reward / penalty values. These are the final values I settled on.
+
+ # q-learning variables
+ EPSILON = 0.5
+ EPISODE_DECAY = 0.9998
+ LEARNING_RATE = 0.1
+ DISCOUNT = 0.95
+
+ # rewards / penalties
+ SHOOT_PENALTY = 0
+ DEATH_PENALTY = 50
+ MISS_PENALTY = 10
+ IDLE_PENALTY = 1
+ KILL_REWARD = 500
+
+I found the most difficult part to be properly tuning the reward / penalty
+system. It was easy to over-penalize and discourage the agent from making any
+moves whatsoever. I originally played with a small penalty to shooting in
+general, which would be rewarded back on an alien hit, but this cause the agent
+to not take any risks and either not move or hide behind shields.
+
+# Analysis and Training
+
+Depending on variable settings, I found 20-50 full games to be an optimal amount
+of training. For the agent, this results in about 3000-10000 episodes of
+training. An episode being an event frame called from MAMEToolkit. I found that
+depending on the variables, training time, and general randomness, the agent
+would tend to develop certain "strategies". Listed are a few that I noticed.
+
+* Stay near corner and shoot
+* Shoot & dodge
+* Duck between shields
+* Shoot once and don't move
+
+The first strategy usually arises if the agent gets shot immediately after
+spawning too many times. The agent becomes hesitant to move beyond the first
+shield. This behavior compounds if the agent is penalized for hitting the
+shield. Played with moderation however, this is a valid strategy. This is
+because a player, provided they are able to shoot the aliens, is able to keep
+the alien row from bumping into the edge, preventing the aliens from coming
+closer.
+
+The second strategy is the most standard seen by human players. The agent has a
+tendency to dodge right before an alien shot hits it. Sometimes this works,
+sometimes the agent is too late, and sometimes the agent dodges right into
+another alien shot.
+
+Strategy number three is similar to the previous. The main difference is that
+the agent fires a lot less shots, and prefers to place more precise shots while
+hiding behind the shields. The is what I imagined the ideal outcome of the
+rewards / penalties would be.
+
+The last strategy occurs when I over train the agent, the penalties are too
+harsh, or both. The agent simply fires one shot and does nothing, eventually
+getting shot by the aliens.
+
+
+# Results
+
+
+![Sample Q-Table After Training](q_table.png){ width=700px }
+
+![Average Reward Given - Epsilon = 0.5](Figure_1.png){ width=700px }
+
+![Average Reward Given - Epsilon = 0.7](Figure_2.png){ width=700px }
+
+![Average Reward Given - Epsilon = 0.1](Figure_3.png){ width=700px }
+
+![Average Reward Given - Epsilon = 0.5 No Miss Penalty](Figure_4.png){ width=700px }
+
+![Agent High Score - 590 points](high_score.png){ width=500px }
+
+\newpage
+
+From the above results, I found that keeping epsilon = 0.5 gave me the best
+results. Epsilon = 0.7 was a bit more volatile, and seemed to work in the short
+term. However, the increased randomness is less sustainable than Epsilon = 0.5.
+Epsilon = 0.1 gave poor results. Not enough randomness left the agent picking
+strategies that were less than ideal.
+
+Interestingly, removing the miss penalty lead to a strong start by the agent.
+However, the average results began decaying after extended runs. If the training
+time is moderated, this could be the strongest method.
+
+The final results met my goals. Space Invaders is low-scoring compared to other
+games, so a score of 590 is respectable. A score like that is very much the
+expected score for a beginner. In order to improve these results, I would likely
+need to either add more states, convert the agent to a Deep Q-Learning network,
+or both.
+
+[^1]: https://github.com/aleju/mario-ai
+[^2]: https://www.mamedev.org/
+[^3]: https://www.computerarcheology.com/Arcade/SpaceInvaders/RAMUse.html
+[^4]: https://github.com/M-J-Murray/MAMEToolkit