CS 5751 (Spring 2001) Program 6

Computer Science 5751
Machine Learning

Programming Assignment 6
Reinforcement Learning (65 points)
Due Friday, May 4, 2001 by 3:30 pm, NO LATE PROGRAMS ALLOWED!

Introduction

In this lab you will implement a simple Q-learning algorithm to solve a particular task: balancing a pole on a cart. Your team assignments can be found here.

The Pole Balancing Problem

The pole balancing problem is a classic one in reinforcement learning. In it, the controller is attempting to keep a pole that is balanced on a cart from falling flat by applying pushes to the cart. The code for this problem can be found in the archive sbp.tar.Z. The code has a simple simulator that also has a window that shows the pole balance situation. To generate the window output I used a set of code called EZX. Due to this, the code for the simulation was written in C rather than in C++. If you like, you may convert this code to Java and write your simulator in Java.

The code provided includes a simple game form of the problem. The game asks you as the user to choose from among 5 actions to try to balance the pole: applying a small positive push, a medium positive push, a small negative push, a medium negative push, or no push. Try playing the game several times to get a feel for how it works.

Once you understand the simulator you should then implement a new version of the code that can learn a Q table to choose actions. You will then test how well your Q representation works by adding code to periodically see how it does after certain points in the training process.

The Q Representation

The game version of this problem has five possible actions. You should use the same set of actions in your learned controller. You then need to figure out how to represent the current pole balancing situation as a state. You should consider representing the state of the system by looking at two or three values about the cart and pole:

The angle of the pole: since this is a continuous value you should divide the possible angles into a set of discrete bins. For example, break the angle values into groups like -90 degrees to -30 degrees (bin 0), -30 degrees to -20 degrees (bin 1), -20 to -10 degrees (bin 2), -10 degrees to -5 degrees (bin 3), -5 degrees to 0 (bin 4), etc.
The angular velocity of the pole: this is also a continuous value. And again you should pick a set of bins. You may want to change the simulator to print out the velocity values so you can choose an appropriate set of bins.
Since the simulator also terminates when the cart runs into the left or right wall, you may want to add something to the state indicating when the cart is close to either wall (e.g., three bins, one when the cart is near the left wall, one when it is near the right wall, and one when it is not near either).

If your representaions has 10 bins for angle, eight for angular velocity, and three for the cart position then there are 240 possible states. You can give each state a unique number by calculating a value like this:

  state = (ANGLEBIN - 1) * 8 * 3 + (VELOCITYBIN - 1) * 3 + (POSITION - 1)

Then your Q table is simply a two dimensional array with the first dimension being the number of states and the second dimension the number of actions.

Learning the Q Table

To learn the Q table you should run a large number of pole balancing games. A game should end either when the pole drops or if 500 steps are reached. The reward the controller receives is a large negative value when the pole drops or when the cart hits the wall. All other rewards should be 0.

For the discount factor I would suggest a high value such as 0.9 or 0.95 (you may want to make this an input to your system). The Q update rule you should use is the one from class (on the second "Nondeterministic Case" overhead). To select from amongst the actions I would suggest an approach involving a probability of selecting the "best" (highest Q value action). With some probability p at each step you should choose the best action, otherwise choose an action at random. To make this technique work I would start this probability at a low value during early learning (allowing lots of exploration) and then increasing the probability for later games (to allow more exploitation).

Your general learning approach should work as follows:

  FOR game = 1 TO MaxNumberOfTrainingGames DO

    initialize a pole balancing game

    REPEAT
      determine the state
      select an action
      perform that action
      measure the reward
      update the Q function
    UNTIL the pole drops OR 500 steps have happened

    IF a certain number of games have passed THEN
      evaluate the current Q table

To evaluate the current Q table you should run a certain number of games (perhaps 50) and count for each game how many steps occur before the pole drops (or that 500 steps were reached). Note that this is slightly different from the learning process in that when you select an action during this testing process you should always choose the action with the highest Q value.

Experiments

Conduct several experiments to see how well your state representation works. Note that you may want to make the definition of the states (i.e., how many bins, what the bin thresholds are, etc.) inputs to your system so that you can try different representations. Train each of your different representations several times for a large number of games (100000) periodically stopping (after each 1000 games) to evaluate how good the solution is so far. Graph your results for each experiment and discuss how well each of your representations works.

What to Turn In

Print out a copy of your team's code. Also, construct a report of the results of your experiments from the previous section with the results mentioned.

Next your team should write up a short report (at least one page, no more than three) discussing your design decisions in implementing the Q learning algorithm.

Finally, each member of your team should write a short report (at least a half a page and no more than one page) discussing what you contributed to the effort and how the overall team approach worked for you.