CS 4611 (Spring 2003) Program 3

Computer Science 4611
Database Management Systems

Programming Assignment 4
System Catalog and File Manager (75 points)
Due Wednesday, April 16, 2003

Introduction

In this part of the assignment you will build on your buffer manager and add two critical layers to your DBMS. In this assignment you will add a File Manager layer that is able to implement both Heap Files and Linearly Hashed Files. In both cases you will use each approach as the primary representation for the file (we will be using Alternative 1 representations for both). There will be no separate indexes. You will also be implementing a system catalog. This catalog will be implemented as a set of relations stored as Heap Files. You will have a relation describing your relations with at least the relation name, whether the relation is a Heap File, how many attributes the relation has and how many index attributes it has if it has an index. You will also have a relation describing the relation attributes with the relation an attribute belongs to, the name of the attribute, the type of the attribute (int, float or string with a size), the number (in order for the tuple) of the attribute and the size if a string attribute. Finally, you will have a relation describing the index attributes with at least the relation each index attribute belongs to and the number of the index.

The Code

To test and assess your code you will be connecting your routines to a set of commands in a simple query language. To set this up you will need to download the the file fc.tar.Z. This is a tared archive file. To unpack this file you should do the following:

  % uncompress fc.tar
  % tar xvf fc.tar

This will create a directory FileCat containing a set of provided code. You should add your implementation of buf.C and buf.h to this directory and then try compiling. The resulting code is called as follows:

  querycmd [gory] [db] [bm] [hf] [qe] [dbfile]

The options gory, db, bm, hf and qe set debugging flags as in the BufMgr which you may take advantage of. If you include a filename at the end of the querycmd line the program will attempt to open a database (that we have previously created) with that name, and this file will then be updated and saved when your session ends (making it possible to use the file again in the future).

Note that if you start up a session without providing a db file name the code will create a file named __minidb_file. This is important because if your code unexpectedly terminates before completing the resulting file will likely be left in a state such that it can no longer be read in (and you will have to delete the file).

The Query Language

The query language is a very simple query language designed for this project. Note that there is currently no error checking (in large part because many errors can only be checked using the code you provide). The following commands are provided in the language (and should be completed in this project):

CREATE RELATION relID ( ID1 : type1 , ID : type2 , ... ) ;
CREATE RELATION relID ( ID : type1 , ID : type2 , ... ) PRIMARYKEY ( PID1 , PID2 , ... ) ;
This requests the DB to create a relation named relID with attributes named ID1,ID2,... with types type1,type2,... where the type string can be INT, FLOAT, or STRING(num) where num is an integer indicating the size of the string. The names of the relation (and all other names in the query language must start with a letter and can be followed by an arbitrary number of letter, digit and underscore characters.

When creating a relation, you may also list a subset of the attributes that are to be used as a key. If no primary key is listed the file should be implemented as a heap. If there is a primary key, the file should be implemented using linear hashing.

Your code should check all appropriate errors (does the relation name already exist, is an attribute name inappropriate, is a string size inappropriate, does an attribute name occur more than once, does each primary key attribute correspond to an actual attribute, is any primary key attribute used twice, etc.) and then make appropriate entries into the system catalog and set up the initial header page for the file.
DROP RELATION relID ;
Delete the requested relation (get rid of the relation file and the entries in the system catalog).
DELETE ( cond ) FROM [ relname variablename ] ;
Delete all of the tuples matching the condition cond from the relation named relname. In order to be able to refer to fields of the relation you will supply after the relation name the name of a variable that can be used in the condition to specify attributes.

The following are legal conditions:
- An integer literal (e.g., 1, -2, 341), a float literal (-6.34, 42.12, 134.5), and string literals (e.g.,"a string","another string") are legal conditions
- An attribute field referred to as relationvariablename.attributename is a legal condition
- If cond is a legal condition then so is:
  - ( cond )
  - NOT cond
- If cond1 and cond2 are legal conditions then so are:
  - cond1 == cond2
  - cond1 != cond2
  - cond1 < cond2
  - cond1 <= cond2
  - cond1 > cond2
  - cond1 >= cond2
  - cond1 AND cond2
  - cond1 OR cond2
For evaluation purposes, a condition corresponding to an integer is false if the value is 0 and true otherwise, for a float, false if the value is 0.0 and true otherwise and for a string, false if the string is the empty string "" and true otherwise. Note also that float literals can have an optional minus sign at the start and then follow that with one or more digit characters followed by a period followed by one or more digit characters.
INSERT INTO relname FROM filestring;
Insert tuples into the relation relname from the file (note that the file string is a string literal, the string will be contained between double quotes). You should assume that the file consists of a set of tuples, each on a line with the attribute values separated by tab characters.
INSERT INTO relname ( value1 , value2, ... );
Insert a tuple into the relation relname with values value1,value2,... Note that the values must appear in the order of the attributes (and their types should match).
SHOW CATALOG ;
Print out the three relations of the system catalog in a nice tabular format (as in SQL).
SHOW RELATION relname ;
Print out the tuples of the relation relname in a nice tabular format.
DUMP CATALOG ;
Print out the contents of the system catalog relations in a format showing the details of the implementation of that relation by page (show each page of each relation, what data is contained on that page, etc.).
DUMP RELATION relname ;
Same as DUMP CATALOG but for named relations.
To quit type <ctrl>D at the command prompt.

The query language also contains hooks for PROJECT, SELECT and JOIN queries, we will be completing these in the next section.

Implementation Suggestions

In order to make it impossible for users to change the system catalog relations directly you should give these relations names that could not be generated by the user (the easiest way is to start the name with an underscore character). The system catalog relations should be stored as heap files and searched appropriately.

For the heap files, each file should have a header page with base information on it (pointers to a doubly-linked list of full heap pages and to a doubly-linked list of not full pages). Each remaining heap file page (the non header pages) should devote the first X bytes to header information. This would include the previous and next page numbers (use -1 for nil). This should also include info for a slot directory (you should assume that tuples are always of a fixed size). The slot directory should have a number indicating the number of slots followed by the actually slot directory with one bit for each record slot.

For the linearly hashed files you should have header pages corresponding to the page directory. The first of these pages should have a next pointer (a number in the directory) indicating where the next split should occur. The first page should also contain the number of entries in the directory currently and the level number. Each page should also have an available pointer to point to the next header page (you may also want to have a value on each page indicating what the first directory entry on the page's number is). The remainder of each page should contain page numbers corresponding to buckets. If the first header page fills you should add extra header pages as needed.

Each directory entry in the linearly hashed file should point to a bucket page. Each bucket page should have many of the same fields as the heap file pages.

When processing a command you should start by figuring out the relation or relations needed. To begin to work with these relations you should get the first page of each relation (use the db command get_file_entry) to get the first page of the file corresponding to that relation. Then process the file as needed.

For example, to process an insert tuple command, you will first need to verify that the relation exists (either by looking it up in the db using get_file_entry or by looking it up in the relation relation). Then you need to go through the relation relation to find out how many attributes the relation has and the format of its file, then through the attribute relation to find the names of the attributes and if needed through the index relation to find the names of the index attributes. Then you can check the validity of the insert fields. Then depending on the type of the file you can attempt to insert the tuple. If the file is a heap file you will look for the next page with an available slot and add the tuple to that page. If the resulting page becomes full you will need to remove it from the not full page list and move it to the full page list (pinning and dirtying all affected pages along the way).

If the file is a linearly hashed file you will need to find the first bucket page and then search through all of the bucket pages to make sure a tuple with the same key has not already been inserted (otherwise reject the insertion). Once you determine no matching tuple exists you should add the tuple to the first available page with an open slot. If no such page exists you should create another overlow page (and trigger the process of extending the directory).

For each type of file you may want to create one (or two) scanning mechanisms to connect to each file (these will be useful in the next part of the DBMS). For a scan you would initialize the scan to return the first record from the file. Then everytime you tug on the scan you will be given the next record from the file. For the linear hashing files you should create a scan mechanism that returns tuples from all of the pages associated with the file, and one that returns only those pages corresponding to a particular bucket value. These scans could then be used in working through a file.

What to Turn In, and When

Print out your versions of all the code you add. You should test your code by showing examples of the various query commands. Note that you should test your code extensively considering many possible combinations of commands. You may also want to edit the code at some point to reduce the page size and run some tests with a much smaller page size to show a case involving a linear hashing file where the header page of the linear hashing file covers more than one page. You should hand in copies of all of your tests. Next, write up a team report of how your code is implemented. This report should give an overview of how you completed the various sections of this project. It should also discuss the algorithms you used to solve the problem. This report should be at least two pages long but no longer than four pages. Each team member should also write up an individual report (at least half a page but no more than a page) discussing their contributions to the coding process and how the overall team interaction went.

You must also submit your code electronically (but only once for each team). To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs4611-1-s2003/uploa d.cgi and follow the directions for uploading a file.

To make your code easier to check and grade please use the following procedure for collecting the code before uploading it: