Computer Science 5641
Compiler Design
Project Part 1 (20 points)
Due September 30, 2004

Over the course of the semester we will be implementing a working interpreter for a simple language. The project will be divided into several smaller parts, each building on previous parts. In this first section we will begin with something (hopefully) straightforward and concentrate on developing team skills.

Part I: Scanner Support

One of the simple but important tasks in implementing a compiler or interpreter is managing the identifiers provided by the users. This is generally done through a hash table containing identifiers where space is often managed by storing the strings in a string space. This allows the lookup and comparison of strings to occur very quickly. For this work you will implement a StringSpace class and a StringTable class as described below:

StringSpace - a string space consists of a series of pages of data each consisting of a series of unique strings. The string space is used by the compiler whenever a user supplies an identifier (and we will also use it for reserved words). If the string is new the compiler will add it to the top page of data and from that point on use a pointer to the page and the offset on that page as the unique reference to that string (so that string comparison operators can mostly be avoided). To this end you should implement this data structure.

A StringSpace object is one that maintains (internally) a set of data pages on which it stores strings (as many as will fit on each page). For a real system, the size of the page would be determined by system parameters (page sizes in memory and/or on disk). For our work, the page size should be a parameter used to create the initial StringSpace object (you can set the page size small when testing your programs and increase the size later when using this module). A StringSpace provides one function for other users, an insert_string function that takes a pointer to a string, stores the string on some page, and returns a reference to where the string is stored in the form of a related data structure called a StringSpaceEntry. A StringSpaceEntry maintains (internally) a pointer to the page of data containing the string and an offset on that page where the string can be found. A StringSpaceEntry should include a function to return the actual string corresponding to that entry, as well as functions to compare a new string to that entry (and possibly) another StringSpaceEntry.

Data in the string space is maintained in pages. Each page has an array of characters it can use to store 1 or more strings. When a request to insert a string comes in you should add the string to the top page if there is enough space or add a new page as the top page and then add the string to that page.

StringTable - a string table is a hash table used to store strings encountered by the compiler. For our work we will also use it to recognize keywords (to make scanning a bit easier). When a string is encountered during scanning the compiler will lookup that string in the StringTable. If the string is not already in the table it will be added to the StringTable. A string is added to the string table by adding the string to the string space and then using the StringSpaceEntry as the data for the hash table. Your hash entry should also include a token number. You should plan on implementing the hash table using a linked list to deal with collisions. The number of buckets in the hash table should be a parameter used in creating the StringTable so that you can set it small for testing and make it larger later. The token number corresponding to a hash table entry for the moment can be a random value. During scanning we will start by inserting all of the reserved words into the StringTable and their corresponding token numbers. Then, when we find an identifier in the program we will determine if that identifier corresponds to a reserved word by looking it up in the string table.

Testing - make sure to carefully test your code (as your later code will depend on this). You should plan to implement test programs to test both your string space and string table implementations and submit results from this testing to demonstrate that your program is working.

Writeup - document your code and your testing. You should write a joint group summary for the work (one page for this part of the project). I would also like each team member to write up an individual summary of the project and how the team interaction is going (less than one page).

Part II: A Transducer for the Tokens

The set of tokens that will be used in the language we will be implementing are described below:

Punctuation:    ;    ,    (    )    {    }

Operators:    =    +    -    *    /    ==    !=    <    >    <=    >=    !    &&    ||   <<   >>   .

Reserved words: char int float if else fi struct while elihw (case matters)

Identifier: a letter followed by 0 or more letters, digits, or underscore (_) characters

Character constant: a single typed character between single quotes (') -- but the newline character can not appear between the single quotes, nor can a tab, the single quote character itself or the backslash character (\). These character constants are represented (respectively) by '\n' (newline), '\t' (tab), '\'' (single quote) and '\\' (backslash).

Integer constant: one or more digit characters

Float constant: one or more digit characters, followed by a period (.), followed by one or more digit characters

String constant: a double quote character followed by 0 or more string components ending with a double quote character. A component of a string can include any character except a newline character, a tab character, a backslash character or a double quote character. These characters are represented (respectively) by \n (newline), \t (tab), \\ (backslash) and \" (double quote).

Transducer - draw (by hand or you can use a program) a transducer that would recognize these tokens (you may want to spend some time on this as you will be implementing the resulting transducer as part of the second part of the project). You should assume that the reserve words are recognized initially as Identifiers and then as reserve words by looking them up in the string table (we will initialize the string table with the set of reserve words).