CS 5541 Program 2

CS 5541
Program 4 - Reinforcement Learning in a Maze
Due December 16, 2010 (No LATE Programs!)
60 points

Introduction

Reinforcement learning provides a method for learning in games. In this program you will implement a simple reinforcement learning mechanism to learn in a simple maze game involving a starting point, a goal, obstacles, and an opponent that is trying to catch you.

Some Implementation Details

In this code (prog4.lisp) a simple maze game is implemented in lisp. The code implements two routines of interest:

play-games <numgames> <showonscreen> <training> - play-games is the basic routine for playing the first argument, <numgames>, number of games. The second argument <showonscreen> is set to T if you want to see the game sequence and NIL if you do not. The third argument, <training>, is set to T if you want the model to be trained using the sequence of examples observed and NIL if you do not.
run-experiment - takes no arguments and makes a series of calls to play-games to train a model for a certain number of games (where training is set on) and then tests with 100 games where training is set off (testing).

To see how the game works you can load the initial code (load "prog4.lisp") and then type (play-games 5 T nil) to play 5 games interactively. the code shows the maze to the keyboard and requests that a user (you) supply an action to choose. The Maze shown has Xs for obstacles, a p for the player, an O for the opponent who is chasing you and a G for the goal. Actions are 1 (left), 2 (up), 3 (right), 4 (down). The code as written queries the user about the move to make. You will replace this code with code to have a computer learner play the game. The relevant routines you need to implement are:

choose-player-action <board> <player_location> <opponent_location> <training> - if training is set to nil you are testing, this means that you should take the state of the game and pick which of the four actions has the highest Q value for that state. If training is not set to nil (T for example) you are updating the model. For this work you will use a simple exploration/exploitation model. 97% of the time you will choose the action with the highest Q value for the state (if more than one action has the same Q value pick one of them randomly) and the other 3% of the time you will pick an action randomly.
initialize-model - you should use this variable to set up your Q table (see below) and your NumVisits table (also below).
do-train <prev_state> <action> <curr_state> <game_status> - this routine is called when a state,action,resulting state sequence has been encountered (or state,action,game ends status is encountered). You can try running the initial code as (play-games 5 T T), this will show you the situations in which do-train is called. Note that a state is a list of two lists, the first list has the row and column of the player and the second list has the row and column of the opponent. There are two possible ways do-train can be called:
- curr_state is nil - in this case the game ended and only prev_state and action are set, you need to then check game_status to see if the player or opponent won (and if neither of those is true the game is a draw), based on the case you will then have a new estimate of Q(<prev_state>,<action>) as the value of the player winning, the opponent winning or a draw (100, -100, or 0, respectively).
- curr_state is not nil - the game is not over, so you first determine the max value for Q(<curr_state>,A) for all for actions A, your new estimate for Q(<prev_state>,<action>) is then 0 (the value of any step not ending the game) + discountfactor * max Q(<curr_state>,A)
Once you have the new estimate for Q(<prev_state>,<action>) you should update its value in the table like this:
1. Visits(<prev_state>,<action>) = Visits(<prev_state>,<action>) + 1
2. Q(<prev_state>,<action>) = (1 / Visits(<prev_state>,<action>)) * (new estimate of Q(<prev_state>,<action>)) + (1 - (1 / Visits(<prev_state>,<action>))) * (current value of Q(<prev_state>,<action>)

You will need to create a description of the board (the state of the board). Since the obstacles are all fixed and the goal is always in the same location, the only thing that changes about the board is the position of the player and the opponent. You will need to come up with a mechanism to assign a unique state number to each possible combination of positions of the player and opponent (I suggest you write a function to do this).

You will then need to create a Q table and a Visits table. The former will need to have a row for each possible state number and then 4 columns for each action. The initial values of this table should all be 0.0. The Visits table will count how many times you have tried each state/action combination. This will also have a row for each possible state number and 4 columns for each action. The initial values of this table should be 0.

Testing

Once you have your code learning you should use the code run-experiment to generate five learning curves (call it five times). This code trains your model using 100 games and then plays 100 games with learning turned off, then 100 more games (for a total of 200 training games), then plays 100 games without learning, then 100 more, etc. This code should produce something like this:

Training up to 100
Training up to 200
Training up to 300
Training up to 400
Training up to 500
Training up to 600
Training up to 700
Training up to 800
Training up to 900
Training up to 1000
Training up to 1250
Training up to 1500
Training up to 2000
Training up to 2500
Training up to 3000
 100:   21 Wins,  78 Losses,   1 Draws
 200:   36 Wins,  42 Losses,  22 Draws
 300:   40 Wins,  25 Losses,  35 Draws
 400:   41 Wins,  22 Losses,  37 Draws
 500:   54 Wins,  22 Losses,  24 Draws
 600:   55 Wins,  21 Losses,  24 Draws
 700:   63 Wins,   6 Losses,  31 Draws
 800:   68 Wins,   7 Losses,  25 Draws
 900:   56 Wins,   6 Losses,  38 Draws
1000:   62 Wins,   8 Losses,  30 Draws
1250:   63 Wins,  12 Losses,  25 Draws
1500:   72 Wins,   5 Losses,  23 Draws
2000:   77 Wins,   2 Losses,  21 Draws
2500:   73 Wins,   0 Losses,  27 Draws
3000:   72 Wins,   3 Losses,  25 Draws

The first part (e.g., Training up to 100, etc. simply shows the progress of the training), the second part shows how many wins/losses/draws the player gets after that amount of training (so 200: 36 Wins, 42 Losses, 22 Draws indicates that after 200 training games the computer player wins 36 games, loses 42 and has 22 draws when testing). Average these results for your five runs and present the result as a graph with the x axis being the number of training games and the y axis showing the average number of wins, losses and draws (three separate lines). Discuss your results in the material you hand in.

What To Hand In

Turn in a commented copy of your code, a printout of the five runs of run-experiment you run and the graph of your results along with a discussion of these results.

CS 5541 Program 4 - Reinforcement Learning in a Maze Due December 16, 2010 (No LATE Programs!) 60 points

Introduction

Some Implementation Details

Testing

What To Hand In

CS 5541
Program 4 - Reinforcement Learning in a Maze
Due December 16, 2010 (No LATE Programs!)
60 points