README - spaceinvaders_ai - Space Invaders AI using Q-Learning

README (7347B)

1 % Playing Classic Arcade Games Using Q-Learning
2 % John Kubach
3 % 12/15/20
4
5 # Concept
6
7 Due to their relative simplicity and immediate feedback, older video games are
8 ideal for experiments regarding machine learning. Probably the most well know
9 example, the MarI/O[^1] project, uses deep Q-learning to learn and navigate the
10 levels of the classic game Super Mario World. My plan is to create a machine
11 learning system that is able to play classic arcade games such as Pacman, Space
12 Invaders, and Centipede.
13
14 # Goals
15
16 The end goal of this machine learning system will be to play an arcade game to a
17 similar level of proficiency to that of a casual human player. I also want the
18 system to be "fair". That is, it is not allowed to view code or memory states
19 that a human would not be able to see.
20
21 # Methodology
22
23 ## MAME
24
25 I used the MAME[^2] arcade emulator framework to run the games on my computer.
26 In addition to running these games, MAME allows for the viewing of the memory
27 states for a given game rom. Many of these games have already had their memory
28 states mapped and documented by passionate fans online. Lucky for me, Space
29 Invaders is a rather well documented game.[^3]
30
31 In order to view the memory locations in my code, I used the MAMEToolkit
32 framework.[^4] Using MAMEToolkit, I was able to map core memory locations to
33 Python variables. In addition to memory viewing, MAMEToolkit allows you to step
34 through each frame of the game and output the values of the current memory
35 locations. This is helpful for a Q-Learning setup, as I can get immediate
36 feedback to each action performed.
37
38 ## Q-Learning
39
40 The core of my learning agent is a Q-Learning algorithm. Though most modern
41 approaches tend to use a neural network of some sort, I was interested in how
42 far I could push a straight Q-Learning implementation. I opted for a
43 epsilon-greedy approach to my agent. The high-score driven nature of arcade
44 games seem to best fit this method of Q-Learning.
45
46 The movements in Space Invaders can be mapped into four actions.
47
48 1. Shoot
49 2. Left
50 3. Right
51 4. Idle
52
53 These actions are pretty self-explanatory. The "Idle" action is equivalent to
54 not inputting any buttons or directions. This can be advantageous, as you do not
55 necessarily always want to be moving. Staying hidden behind a shield is a good
56 example of this.
57
58 As for states, many can be made from this game. After some experimentation, I
59 found the following gave me the best results in training time and success rate.
60
61 1. Score point (shoot alien)
62 2. Lose life (shot by alien)
63 3. "Nothing"
64 4. Shot currently moving (has not collided yet)
65 5. Shot hit something other than an alien (shield, ceiling)
66
67 The nothing state signifies that none of the other states are currently
68 happening. Obviously, the game is still at play, but there is no player shot
69 currently out, and the agent hasn't been hit by an alien shot. I include "Shot
70 currently moving" as its own state because I do not want to preemptively
71 penalize the agent for a shot still on its way to the target. This way, I do not
72 discourage the agent from shooting. Shots hitting anything besides aliens result
73 in a minor penalty. The rational was that this would incentivize the agent to
74 place more accurate shots, rather that wildly shooting at all times. This
75 matters more in Space Invaders because only one player shot can be on the field
76 at a given time.
77
78 I spent a good amount of time tweaking the Q-Learning variables, as well as
79 reward / penalty values. These are the final values I settled on.
80
81 # q-learning variables
82 EPSILON = 0.5
83 EPISODE_DECAY = 0.9998
84 LEARNING_RATE = 0.1
85 DISCOUNT = 0.95
86
87 # rewards / penalties
88 SHOOT_PENALTY = 0
89 DEATH_PENALTY = 50
90 MISS_PENALTY = 10
91 IDLE_PENALTY = 1
92 KILL_REWARD = 500
93
94 I found the most difficult part to be properly tuning the reward / penalty
95 system. It was easy to over-penalize and discourage the agent from making any
96 moves whatsoever. I originally played with a small penalty to shooting in
97 general, which would be rewarded back on an alien hit, but this cause the agent
98 to not take any risks and either not move or hide behind shields.
99
100 # Analysis and Training
101
102 Depending on variable settings, I found 20-50 full games to be an optimal amount
103 of training. For the agent, this results in about 3000-10000 episodes of
104 training. An episode being an event frame called from MAMEToolkit. I found that
105 depending on the variables, training time, and general randomness, the agent
106 would tend to develop certain "strategies". Listed are a few that I noticed.
107
108 * Stay near corner and shoot
109 * Shoot & dodge
110 * Duck between shields
111 * Shoot once and don't move
112
113 The first strategy usually arises if the agent gets shot immediately after
114 spawning too many times. The agent becomes hesitant to move beyond the first
115 shield. This behavior compounds if the agent is penalized for hitting the
116 shield. Played with moderation however, this is a valid strategy. This is
117 because a player, provided they are able to shoot the aliens, is able to keep
118 the alien row from bumping into the edge, preventing the aliens from coming
119 closer.
120
121 The second strategy is the most standard seen by human players. The agent has a
122 tendency to dodge right before an alien shot hits it. Sometimes this works,
123 sometimes the agent is too late, and sometimes the agent dodges right into
124 another alien shot.
125
126 Strategy number three is similar to the previous. The main difference is that
127 the agent fires a lot less shots, and prefers to place more precise shots while
128 hiding behind the shields. The is what I imagined the ideal outcome of the
129 rewards / penalties would be.
130
131 The last strategy occurs when I over train the agent, the penalties are too
132 harsh, or both. The agent simply fires one shot and does nothing, eventually
133 getting shot by the aliens.
134
135
136 # Results
137
138
139 ![Sample Q-Table After Training](q_table.png){ width=700px }
140
141 ![Average Reward Given - Epsilon = 0.5](Figure_1.png){ width=700px }
142
143 ![Average Reward Given - Epsilon = 0.7](Figure_2.png){ width=700px }
144
145 ![Average Reward Given - Epsilon = 0.1](Figure_3.png){ width=700px }
146
147 ![Average Reward Given - Epsilon = 0.5 No Miss Penalty](Figure_4.png){ width=700px }
148
149 ![Agent High Score - 590 points](high_score.png){ width=500px }
150
151 \newpage
152
153 From the above results, I found that keeping epsilon = 0.5 gave me the best
154 results. Epsilon = 0.7 was a bit more volatile, and seemed to work in the short
155 term. However, the increased randomness is less sustainable than Epsilon = 0.5.
156 Epsilon = 0.1 gave poor results. Not enough randomness left the agent picking
157 strategies that were less than ideal.
158
159 Interestingly, removing the miss penalty lead to a strong start by the agent.
160 However, the average results began decaying after extended runs. If the training
161 time is moderated, this could be the strongest method.
162
163 The final results met my goals. Space Invaders is low-scoring compared to other
164 games, so a score of 590 is respectable. A score like that is very much the
165 expected score for a beginner. In order to improve these results, I would likely
166 need to either add more states, convert the agent to a Deep Q-Learning network,
167 or both.
168
169 [^1]: https://github.com/aleju/mario-ai
170 [^2]: https://www.mamedev.org/
171 [^3]: https://www.computerarcheology.com/Arcade/SpaceInvaders/RAMUse.html
172 [^4]: https://github.com/M-J-Murray/MAMEToolkit

	spaceinvaders_ai Space Invaders AI using Q-Learning
	Log \| Files \| Refs \| README