CS 4611 (Spring 2004) Program 3

Computer Science 4611
Database Management Systems

Programming Assignment 4
B+ Tree Files and the System Catalog (75 points)
Due Thursday, December 2, 2004

Introduction

In this part of the assignment you will build on your buffer manager and add two critical layers to your DBMS. In this assignment you will add a File Manager layer that creates B+ tree files. You will use this representation as the primary representation for each file (we will be using Alternative 1 representations for both). There will be no separate indexes. You will also be implementing a system catalog. This catalog will be implemented as a set of relations stored as relation files. You will have a relation describing your relations with at least the relation name, how many attributes the relation has and how many index attributes it has. You will also have a relation describing the relation attributes with the relation an attribute belongs to, the name of the attribute, the type of the attribute (int, float or string with a size), the number (in order for the tuple) of the attribute and the size if a string attribute. Finally, you will have a relation describing the index attributes with at least the relation each index attribute belongs to.

The Code

To test and assess your code you will be connecting your routines to a set of commands in a simple query language. To set this up you will need to download the the file fc.tar.Z. This is a tared archive file. To unpack this file you should do the following:

  % uncompress fc.tar
  % tar xvf fc.tar

This will create a directory FileCat containing a set of provided code. You should add your implementation of buf.C and buf.h to this directory and then try compiling. The resulting code is called as follows:

  querycmd [gory] [db] [bm] [hf] [qe] [dbfile]

The options gory, db, bm, hf and qe set debugging flags as in the BufMgr which you may take advantage of. If you include a filename at the end of the querycmd line the program will attempt to open a database (that we have previously created) with that name, and this file will then be updated and saved when your session ends (making it possible to use the file again in the future).

Note that if you start up a session without providing a db file name the code will create a file named __minidb_file. This is important because if your code unexpectedly terminates before completing the resulting file will likely be left in a state such that it can no longer be read in (and you will have to delete the file).

The Query Language

The query language is a very simple query language designed for this project. Note that there is currently no error checking (in large part because many errors can only be checked using the code you provide). The following commands are provided in the language (and should be completed in this project):

CREATE RELATION relID ( ID1 : type1 , ID : type2 , ... ) ;
CREATE RELATION relID ( ID : type1 , ID : type2 , ... ) PRIMARYKEY ( PID1 , PID2 , ... ) ;
This requests the DB to create a relation named relID with attributes named ID1,ID2,... with types type1,type2,... where the type string can be INT, FLOAT, or STRING(num) where num is an integer indicating the size of the string. The names of the relation (and all other names in the query language must start with a letter and can be followed by an arbitrary number of letter, digit and underscore characters.

When creating a relation, you may also list a subset of the attributes that are to be used as a key.

Your code should check all appropriate errors (does the relation name already exist, is an attribute name inappropriate, is a string size inappropriate, does an attribute name occur more than once, does each primary key attribute correspond to an actual attribute, is any primary key attribute used twice, etc.) and then make appropriate entries into the system catalog and set up the initial header page for the file.
DROP RELATION relID ;
Delete the requested relation (get rid of the relation file and the entries in the system catalog).
DELETE ( cond ) FROM [ relname variablename ] ;
Delete all of the tuples matching the condition cond from the relation named relname. In order to be able to refer to fields of the relation you will supply after the relation name the name of a variable that can be used in the condition to specify attributes.

The following are legal conditions:
- An integer literal (e.g., 1, -2, 341), a float literal (-6.34, 42.12, 134.5), and string literals (e.g.,"a string","another string") are legal conditions
- An attribute field referred to as relationvariablename.attributename is a legal condition
- If cond is a legal condition then so is:
  - ( cond )
  - NOT cond
- If cond1 and cond2 are legal conditions then so are:
  - cond1 == cond2
  - cond1 != cond2
  - cond1 < cond2
  - cond1 <= cond2
  - cond1 > cond2
  - cond1 >= cond2
  - cond1 AND cond2
  - cond1 OR cond2
For evaluation purposes, a condition corresponding to an integer is false if the value is 0 and true otherwise, for a float, false if the value is 0.0 and true otherwise and for a string, false if the string is the empty string "" and true otherwise. Note also that float literals can have an optional minus sign at the start and then follow that with one or more digit characters followed by a period followed by one or more digit characters.
INSERT INTO relname FROM filestring;
Insert tuples into the relation relname from the file (note that the file string is a string literal, the string will be contained between double quotes). You should assume that the file consists of a set of tuples, each on a line with the attribute values separated by tab characters.
INSERT INTO relname ( value1 , value2, ... );
Insert a tuple into the relation relname with values value1,value2,... Note that the values must appear in the order of the attributes (and their types should match).
SHOW CATALOG ;
Print out the three relations of the system catalog in a nice tabular format (as in SQL).
SHOW RELATION relname ;
Print out the tuples of the relation relname in a nice tabular format.
DUMP CATALOG ;
Print out the contents of the system catalog relations in a format showing the details of the implementation of that relation by page (show each page of each relation, what data is contained on that page, etc.).
DUMP RELATION relname ;
Same as DUMP CATALOG but for named relations.
To quit type <ctrl>D at the command prompt.

The query language also contains hooks for PROJECT, SELECT and JOIN queries, we will be completing these in the next section.

Implementation Suggestions

In order to make it impossible for users to change the system catalog relations directly you should give these relations names that could not be generated by the user (the easiest way is to start the name with an underscore character). The system catalog relations should be stored as relation files and searched appropriately.

For the B+ tree files you should have a base page with simple information on it. This should include the page # of the root page, the page # of the first actual data page and the page # of the last actual data page (remember that the data pages are in some sense a sorted list of the data).

Each data page should devote the first X byes of data to header information. This would include the previous and next page numbers, the parent page (in the non-data pages) and a slot directory (you should assume tuples are of a fixed size). The slot directory should have a number indicating the number of slots followed by the actual slot directory with one bit for each record slot. You may also want to retain a count of the number of slots currently full (so you can decide when you need to split or merge as the case may be).

Each tree page should also have some header information. This would include a pointer (page #) of the parent page as well as a number indicating how many of the slots on the page are in use. Note that as with the data pages, you will need to calculate how many items can fit. Each item on a tree page requires enough bytes to hold the primary key values and enough bytes for a page pointer (a page number of a child). Note that you also need one extra child pointer.

One tactic you may employ is anticipatory splitting on insertion. Under this variation, if you are performing an insert, as you search the tree, you may split any node that is full into two half-full nodes as you move down the tree looking for the appropriate place to insert the new tuple. Under this approach all insertions should proceed strictly top down - you should not need to move back up the tree.

For deletion you may want to consider performing a similar operation (anticipatory merging), but you may also want to simply perform normal deletion.

When processing a command you should start by figuring out the relation or relations needed. To begin to work with these relations you should get the first (header) page of each relation (use the db command get_file_entry) to get the first page of the file corresponding to that relation. Then process the file as needed.

For example, to process an insert tuple command, you will first need to verify that the relation exists (either by looking it up in the db using get_file_entry or by looking it up in the relation relation). Then you need to go through the relation relation to find out how many attributes the relation has and the format of its file, then through the attribute relation to find the names of the attributes and then look through the index relation to find the names of the index attributes. Then you can check the validity of the insert fields.

You may want to create one (or two) scanning mechanisms to connect to each file (these will be useful in the next part of the DBMS). For a scan you would initialize the scan to return the first record from the file. Then everytime you tug on the scan you will be given the next record from the file.

What to Turn In, and When

Print out your versions of all the code you add. You should test your code by showing examples of the various query commands. Note that you should test your code extensively considering many possible combinations of commands. You should hand in copies of all of your tests. Next, write up a team report of how your code is implemented. This report should give an overview of how you completed the various sections of this project. It should also discuss the algorithms you used to solve the problem. This report should be at least two pages long but no longer than four pages. Each team member should also write up an individual report (at least half a page but no more than a page) discussing their contributions to the coding process and how the overall team interaction went.

You must also submit your code electronically (but only once for each team). To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs4611-1-f2004/uploa d.cgi and follow the directions for uploading a file.

To make your code easier to check and grade please use the following procedure for collecting the code before uploading it: