Computer Science 4611
Database Management Systems

Programming Assignment 4
B+ Tree Files and the System Catalog (75 points)
Due Thursday, December 2, 2004

Introduction

In this part of the assignment you will build on your buffer manager and add two critical layers to your DBMS. In this assignment you will add a File Manager layer that creates B+ tree files. You will use this representation as the primary representation for each file (we will be using Alternative 1 representations for both). There will be no separate indexes. You will also be implementing a system catalog. This catalog will be implemented as a set of relations stored as relation files. You will have a relation describing your relations with at least the relation name, how many attributes the relation has and how many index attributes it has. You will also have a relation describing the relation attributes with the relation an attribute belongs to, the name of the attribute, the type of the attribute (int, float or string with a size), the number (in order for the tuple) of the attribute and the size if a string attribute. Finally, you will have a relation describing the index attributes with at least the relation each index attribute belongs to.

The Code

To test and assess your code you will be connecting your routines to a set of commands in a simple query language. To set this up you will need to download the the file fc.tar.Z. This is a tared archive file. To unpack this file you should do the following:

  % uncompress fc.tar
  % tar xvf fc.tar

This will create a directory FileCat containing a set of provided code. You should add your implementation of buf.C and buf.h to this directory and then try compiling. The resulting code is called as follows:

  querycmd [gory] [db] [bm] [hf] [qe] [dbfile]

The options gory, db, bm, hf and qe set debugging flags as in the BufMgr which you may take advantage of. If you include a filename at the end of the querycmd line the program will attempt to open a database (that we have previously created) with that name, and this file will then be updated and saved when your session ends (making it possible to use the file again in the future).

Note that if you start up a session without providing a db file name the code will create a file named __minidb_file. This is important because if your code unexpectedly terminates before completing the resulting file will likely be left in a state such that it can no longer be read in (and you will have to delete the file).

The Query Language

The query language is a very simple query language designed for this project. Note that there is currently no error checking (in large part because many errors can only be checked using the code you provide). The following commands are provided in the language (and should be completed in this project):

The query language also contains hooks for PROJECT, SELECT and JOIN queries, we will be completing these in the next section.

Implementation Suggestions

In order to make it impossible for users to change the system catalog relations directly you should give these relations names that could not be generated by the user (the easiest way is to start the name with an underscore character). The system catalog relations should be stored as relation files and searched appropriately.

For the B+ tree files you should have a base page with simple information on it. This should include the page # of the root page, the page # of the first actual data page and the page # of the last actual data page (remember that the data pages are in some sense a sorted list of the data).

Each data page should devote the first X byes of data to header information. This would include the previous and next page numbers, the parent page (in the non-data pages) and a slot directory (you should assume tuples are of a fixed size). The slot directory should have a number indicating the number of slots followed by the actual slot directory with one bit for each record slot. You may also want to retain a count of the number of slots currently full (so you can decide when you need to split or merge as the case may be).

Each tree page should also have some header information. This would include a pointer (page #) of the parent page as well as a number indicating how many of the slots on the page are in use. Note that as with the data pages, you will need to calculate how many items can fit. Each item on a tree page requires enough bytes to hold the primary key values and enough bytes for a page pointer (a page number of a child). Note that you also need one extra child pointer.

One tactic you may employ is anticipatory splitting on insertion. Under this variation, if you are performing an insert, as you search the tree, you may split any node that is full into two half-full nodes as you move down the tree looking for the appropriate place to insert the new tuple. Under this approach all insertions should proceed strictly top down - you should not need to move back up the tree.

For deletion you may want to consider performing a similar operation (anticipatory merging), but you may also want to simply perform normal deletion.

When processing a command you should start by figuring out the relation or relations needed. To begin to work with these relations you should get the first (header) page of each relation (use the db command get_file_entry) to get the first page of the file corresponding to that relation. Then process the file as needed.

For example, to process an insert tuple command, you will first need to verify that the relation exists (either by looking it up in the db using get_file_entry or by looking it up in the relation relation). Then you need to go through the relation relation to find out how many attributes the relation has and the format of its file, then through the attribute relation to find the names of the attributes and then look through the index relation to find the names of the index attributes. Then you can check the validity of the insert fields.

You may want to create one (or two) scanning mechanisms to connect to each file (these will be useful in the next part of the DBMS). For a scan you would initialize the scan to return the first record from the file. Then everytime you tug on the scan you will be given the next record from the file.

What to Turn In, and When

Print out your versions of all the code you add. You should test your code by showing examples of the various query commands. Note that you should test your code extensively considering many possible combinations of commands. You should hand in copies of all of your tests. Next, write up a team report of how your code is implemented. This report should give an overview of how you completed the various sections of this project. It should also discuss the algorithms you used to solve the problem. This report should be at least two pages long but no longer than four pages. Each team member should also write up an individual report (at least half a page but no more than a page) discussing their contributions to the coding process and how the overall team interaction went.

You must also submit your code electronically (but only once for each team). To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs4611-1-f2004/uploa d.cgi and follow the directions for uploading a file.

To make your code easier to check and grade please use the following procedure for collecting the code before uploading it: