spaceinvaders_ai

Space Invaders AI using Q-Learning
Log | Files | Refs | README

writeup.md (7347B)


      1 % Playing Classic Arcade Games Using Q-Learning
      2 % John Kubach
      3 % 12/15/20
      4 
      5 # Concept
      6 
      7 Due to their relative simplicity and immediate feedback, older video games are
      8 ideal for experiments regarding machine learning. Probably the most well know
      9 example, the MarI/O[^1] project, uses deep Q-learning to learn and navigate the
     10 levels of the classic game Super Mario World. My plan is to create a machine
     11 learning system that is able to play classic arcade games such as Pacman, Space
     12 Invaders, and Centipede.
     13 
     14 # Goals
     15 
     16 The end goal of this machine learning system will be to play an arcade game to a
     17 similar level of proficiency to that of a casual human player. I also want the
     18 system to be "fair". That is, it is not allowed to view code or memory states
     19 that a human would not be able to see.
     20 
     21 # Methodology
     22 
     23 ## MAME
     24 
     25 I used the MAME[^2] arcade emulator framework to run the games on my computer.
     26 In addition to running these games, MAME allows for the viewing of the memory
     27 states for a given game rom. Many of these games have already had their memory
     28 states mapped and documented by passionate fans online.  Lucky for me, Space
     29 Invaders is a rather well documented game.[^3]
     30 
     31 In order to view the memory locations in my code, I used the MAMEToolkit
     32 framework.[^4] Using MAMEToolkit, I was able to map core memory locations to
     33 Python variables. In addition to memory viewing, MAMEToolkit allows you to step
     34 through each frame of the game and output the values of the current memory
     35 locations. This is helpful for a Q-Learning setup, as I can get immediate
     36 feedback to each action performed.
     37 
     38 ## Q-Learning
     39 
     40 The core of my learning agent is a Q-Learning algorithm. Though most modern
     41 approaches tend to use a neural network of some sort, I was interested in how
     42 far I could push a straight Q-Learning implementation. I opted for a
     43 epsilon-greedy approach to my agent. The high-score driven nature of arcade
     44 games seem to best fit this method of Q-Learning.
     45 
     46 The movements in Space Invaders can be mapped into four actions.
     47 
     48 1. Shoot
     49 2. Left
     50 3. Right
     51 4. Idle
     52 
     53 These actions are pretty self-explanatory. The "Idle" action is equivalent to
     54 not inputting any buttons or directions. This can be advantageous, as you do not
     55 necessarily always want to be moving. Staying hidden behind a shield is a good
     56 example of this.
     57 
     58 As for states, many can be made from this game. After some experimentation, I
     59 found the following gave me the best results in training time and success rate.
     60 
     61 1. Score point (shoot alien)
     62 2. Lose life (shot by alien)
     63 3. "Nothing"
     64 4. Shot currently moving (has not collided yet)
     65 5. Shot hit something other than an alien (shield, ceiling)
     66 
     67 The nothing state signifies that none of the other states are currently
     68 happening. Obviously, the game is still at play, but there is no player shot
     69 currently out, and the agent hasn't been hit by an alien shot. I include "Shot
     70 currently moving" as its own state because I do not want to preemptively
     71 penalize the agent for a shot still on its way to the target. This way, I do not
     72 discourage the agent from shooting. Shots hitting anything besides aliens result
     73 in a minor penalty. The rational was that this would incentivize the agent to
     74 place more accurate shots, rather that wildly shooting at all times. This
     75 matters more in Space Invaders because only one player shot can be on the field
     76 at a given time.
     77 
     78 I spent a good amount of time tweaking the Q-Learning variables, as well as
     79 reward / penalty values. These are the final values I settled on.
     80 
     81     # q-learning variables
     82     EPSILON = 0.5
     83     EPISODE_DECAY = 0.9998
     84     LEARNING_RATE = 0.1
     85     DISCOUNT = 0.95
     86 
     87     # rewards / penalties
     88     SHOOT_PENALTY = 0
     89     DEATH_PENALTY = 50
     90     MISS_PENALTY = 10
     91     IDLE_PENALTY = 1
     92     KILL_REWARD = 500
     93     
     94 I found the most difficult part to be properly tuning the reward / penalty
     95 system. It was easy to over-penalize and discourage the agent from making any
     96 moves whatsoever. I originally played with a small penalty to shooting in
     97 general, which would be rewarded back on an alien hit, but this cause the agent
     98 to not take any risks and either not move or hide behind shields.
     99 
    100 # Analysis and Training
    101 
    102 Depending on variable settings, I found 20-50 full games to be an optimal amount
    103 of training. For the agent, this results in about 3000-10000 episodes of
    104 training. An episode being an event frame called from MAMEToolkit. I found that
    105 depending on the variables, training time, and general randomness, the agent
    106 would tend to develop certain "strategies". Listed are a few that I noticed.
    107 
    108 * Stay near corner and shoot
    109 * Shoot & dodge
    110 * Duck between shields
    111 * Shoot once and don't move
    112 
    113 The first strategy usually arises if the agent gets shot immediately after
    114 spawning too many times. The agent becomes hesitant to move beyond the first
    115 shield. This behavior compounds if the agent is penalized for hitting the
    116 shield. Played with moderation however, this is a valid strategy. This is
    117 because a player, provided they are able to shoot the aliens, is able to keep
    118 the alien row from bumping into the edge, preventing the aliens from coming
    119 closer.
    120 
    121 The second strategy is the most standard seen by human players. The agent has a
    122 tendency to dodge right before an alien shot hits it. Sometimes this works,
    123 sometimes the agent is too late, and sometimes the agent dodges right into
    124 another alien shot.
    125 
    126 Strategy number three is similar to the previous. The main difference is that
    127 the agent fires a lot less shots, and prefers to place more precise shots while
    128 hiding behind the shields. The is what I imagined the ideal outcome of the
    129 rewards / penalties would be.
    130 
    131 The last strategy occurs when I over train the agent, the penalties are too
    132 harsh, or both. The agent simply fires one shot and does nothing, eventually
    133 getting shot by the aliens.
    134 
    135 
    136 # Results
    137 
    138 
    139 ![Sample Q-Table After Training](q_table.png){ width=700px }
    140 
    141 ![Average Reward Given - Epsilon = 0.5](Figure_1.png){ width=700px }
    142 
    143 ![Average Reward Given - Epsilon = 0.7](Figure_2.png){ width=700px }
    144 
    145 ![Average Reward Given - Epsilon = 0.1](Figure_3.png){ width=700px }
    146 
    147 ![Average Reward Given - Epsilon = 0.5 No Miss Penalty](Figure_4.png){ width=700px }
    148 
    149 ![Agent High Score - 590 points](high_score.png){ width=500px }
    150 
    151 \newpage
    152 
    153 From the above results, I found that keeping epsilon = 0.5 gave me the best
    154 results. Epsilon = 0.7 was a bit more volatile, and seemed to work in the short
    155 term. However, the increased randomness is less sustainable than Epsilon = 0.5.
    156 Epsilon = 0.1 gave poor results. Not enough randomness left the agent picking
    157 strategies that were less than ideal.
    158 
    159 Interestingly, removing the miss penalty lead to a strong start by the agent.
    160 However, the average results began decaying after extended runs. If the training
    161 time is moderated, this could be the strongest method.
    162 
    163 The final results met my goals. Space Invaders is low-scoring compared to other
    164 games, so a score of 590 is respectable. A score like that is very much the
    165 expected score for a beginner. In order to improve these results, I would likely
    166 need to either add more states, convert the agent to a Deep Q-Learning network,
    167 or both.
    168 
    169 [^1]: https://github.com/aleju/mario-ai
    170 [^2]: https://www.mamedev.org/
    171 [^3]: https://www.computerarcheology.com/Arcade/SpaceInvaders/RAMUse.html
    172 [^4]: https://github.com/M-J-Murray/MAMEToolkit