Options for Scanner Interfaces


Assembly Language Input Structure

Assembly language input files have various levels of structures. A scanner interface will contain some functions for dealing with most or all of these levels. Scanner designers need to make decisions about how the scanner interacts with its clients at these levels.

The Top Level: The Entire Input File

The top level input structure for a scanner is an entire input file. The main question that arises at this level is "Does the scanner open the file or is it opened by one of the other modules?". If the scanner opens the file then the file name must be passed as a parameter to some scanner function. Otherwise, a FILE * variable is passed to the scanner. For a two-pass assembler, the file either needs to be opened twice or else opened once and reset to the beginning using lseek().

When deciding which module is responsible for doing something, there are two considerations: the kind of abstractions that the modules are designed to handle and the complexity of the modules. It is generally best to aim for module designs in which each module deals with a single kind of abstraction. When this principle does not help you to decide then it is reasonable to base the decision on balancing the complexity of the modules. That is, the responsibility goes to the module that has the least complexity.

Lines and Statements

An input file is a sequence of lines. If lines are significant units of information then a scanner needs to provide some functions for dealing with them. For languages like C and Pascal, lines have no significance for syntax, so a scanner for a C or Pascal compiler would not need functions for controlling the advance from one line to the next. The scanner could possibly provide line number information, but it is only used for error reporting.

Lines in assembly language are significant structures. In assembly language, each line is a statement. The statements may be trivial for blank lines or lines with only labels or comments. Thus an assembler scanner does need functions for accessing lines.

Whenever one input structure is a sequence of lower level input structures, there are options regarding how the lower-level structures are accessed. One possibility is that they are assigned indices, which are provided by the client whenever the client wants to access a particular low-level structure. Another possibility is that the scanner provides iteration functions, allowing the client to decide when to advance to the next low-level structure without specifying an index.

For example, an assembler scanner could assign line numbers for each line, requiring clients to specify the line number to access a particular line. Or, the scanner could provide an advance() function that steps from one line to the next.

Generally, the iterative approach is easier for the client, provided that the client does not need to access low-level structure in random order. If sequential access meets the needs of clients then the use of indices forces clients to do extra bookkeepping that is not inherent in the work that they need to do. For an assembler scanner, a two-pass structure deals with all dependencies between lines that could otherwise require random access to lines. Thus an assembler scanner does not require its clients to specify line numbers to access a line. Like a compiler scanner, the assembler scanner may provide line numbers that can be used by the higher-level modules for error reports.

Lower-Level Units

One issue that scanner designers need to eventually decide is what is the lowest level of input structure that the scanner provides to its clients. These lowest-level units are called tokens. The next two sections outline two possible choices for scanner tokens; the section after that discusses the considerations involve in choosing between them. The two possibilities are extremes; there are numerous intermediate possibilities.

Labels, Operations, and Operand Information

Consider the following MIPS assembly language statement.

    else:   lw      $ra,    4($sp)
For the higher-level modules, this statement contains five important pieces of information. Pass 1 must deal with the label else, and needs to look at the operation lw in case it is an assembler directive. One pass 1 sees a machine instruction, it knows it is 4 bytes long. That is all pass 1 needs to know in order to update its location counter.

Pass 2 is not interested in the label, but must know the operation in order to generate binary code. For code generation, pass 2 also needs to know the destination register name $ra, the displacement 4, and the base register name $sp. Pass 2 also needs to know that the displacement and base register are two parts of the same operand.

For one possible way of designing a scanner,, the scanner provides functions that pass back the units that are meaningful to the higher-level modules: labels, operations, and parts of operands. The scanner strips out other parts of the input such as colons, commas, and comments.

Lower-Level Tokens

The statement example above could be handled differently by the scanner. The statement could be regarded as a sequence of lower-level units: identifiers, colons, register names, commas, integers, and parentheses. Thus the scanner passes everything it sees to the higher-level modules, packaged in small units.

Considerations involved in the Choice of Tokens

The choice of the level of the scanner tokens must be based on consideration of the types of abstractions that the scanner and its clients will be forced to deal with, and on the balance of work between the scanner and its clients. For example, consider a scanner for a compiler. High-level languages combine low-level tokens in a wide variety of ways. Thus the use of high-level tokens would require that scanner implementors work with abstractions that are far from the primary responsibility of a scanner: dealing with input at a character level. For this reason, compiler scanners invariably use low-level tokens. This results in more work for a higher-level module (called a parser), but the extra work that the parser has to do involves the same kind of abstraction that it is already involved with.

To see this, consider a possible analog to providing operand information in an assembler. A compiler scanner could attempt to form tokens for each parameter in a function call. However, the fact that high-level languages allow complex expressions to be used as parameters means that the scanner would have to deal with information at the same level as the parser.

For assemblers, the choice is not as clear-cut. In assembly language, the low-level tokens are combined in a very limited number of ways, making it possible for the scanner to form higher-level tokens without dealing with a very different kind of abstraction. Low-level and high level tokens both allow the scanner implementors to deal with a single kind of abstraction, so the choice can be reasonably based on balancing the complexity between the scanner and its clients.


Page URL: http://www.d.umn.edu/~gshute/asm/options.html
Page Author: Gary Shute
Last Modified: Saturday, 24-Mar-2012 10:04:19 CDT
Comments to: gshute@d.umn.edu