Assembler Organization


Assembler Functions

Two important steps are involved in designing software: dividing the software into smaller, more manageable components or modules, and determining how the modules will communicate. For both of these steps, the designer needs to keep the major kinds of software functionality in mind. The following are the primary functions of an assemblers.

Assembly Passes

The simplest organization for an assembler is a two-pass organization. The need for this organization arises from consideration of the assembler functions and the nature of assembly language code.

The memory address for a label cannot be assigned until its definition has been read, but code generation for jumps and branches requires knowing the address of the jump or branch target. Thus the assembler makes two passes through the input. During the first pass, the assembler is assigning memory addresses to labels. Machine code is generated during the second pass.

Dealing with Characters

The input to an assembler consists of a stream of characters which represent assembly information in several different ways: integers, real numbers, labels, quoted characters and strings, register names, and various kinds of punctuation. If character processing is mixed in with algorithms for building symbol tables or algorithms for code generation, the result is a complex mess that would challenge even the best programmers. The problem is that this organization (or disorganization) forces programmers to deal with two distinct kinds of abstraction simultaneously. The result is difficult algorithm development, a large number of errors, and difficulty in debugging. These problems are magnified if the assembler requires maintenance at a later time.

A good general design principle is to assign responsibility for different kinds of abstractions to different program modules. This principle is crucial for large designs, but is a good practice for smaller designs as well. Dividing responsibilities for different kinds of abstractions into different modules allows programmers to focus their attention on one aspect of the problem at a time.

For an assembler, the implication is that there should be a module dedicated to handling text at the level of characters. This module is called a lexical analyser or scanner.

Assembler Modules

Thus an assembler has four main modules, a scanner, a pass 1 module, a pass 2 module, and a main program module. The responsibilities of these modules are decribed in more detail in the following sections.

Lexical analysis: the scanner

The primary purpose of the scanner module is processing characters into higher level units that are more meaningful for the pass 1 and pass 2 modules. These units are called tokens. The process of forming these groups is called lexical analysis. The need for a scanner arises in most programs that deal with complex input.

Part of the design of a scanner involves deciding precisely what a token is; that is, deciding the level of the units that the scanner delivers to the pass modules. Some possibilities are discussed in Options for Scanner Interfaces.

Building the symbol table: pass 1

During pass 1, the input is read and memory addresses are asigned to program labels. Memory is allocated sequentially so that the pass 1 module can use a location counter. For each input statement, this counter is incremented by the size of memory allocated. Whenever a label is encountered, it is recorded in the symbol table. The address assigned is the value of the location counter at the beginning of the statement.

Most assemblers need to do some processing of assembler directives to determine the size of the data involved. Modern RISC processors have fixed instruction lengths, so machine instructions require very little processing during pass 1.

Some assemblers keep data and instructions in separate regions of memory. If this is done then two separate location counters are used, one for data and one for instructions.

The symbol table could be treated as a submodule of the pass 1 module. This choice has little if any effect on the complexity of coding, but a separately compiled submodule does facilitate separate testing.

Generating output: pass 2

The primary effort in pass 2 is translating instructions into machine code. If the asembler is mixing data and instructions in the same area of memory then translation of data must be done in pass 2. For assemblers that use separate ares of memory for data and instructions, translation of data could be moved into pass 1. This is somewhat advantageous in that it results in a better balance of the complexities of the two pass modules.

Most of the error reports generated by an assembler are generated during pass 2. These reports can be interleaved with assembler listing output so that the assembly language programmer can readily associate an error report with the code that caused it.

While pass 2 is running, machine code is saved in a byte array (two arrays if data and instructions are kept separate). If there are no errors then at the end of pass 2 the array(s) is written to a file in binary form and it can also be displayed as a hexadecimal dump for the assembly language programmer. The Assembler Output web page describes C programming techniques for saving binary data in an array and writing the array to a file. There is enough complexity involved in handling the binary data arrays that a separate submodule could be used.

For large instruction sets, a Table-Driven Design is useful. In this approach, instructions are classified according to their operand types. This classification information, along with other coding information, is stored in a table. In a language like C that allows initialization of arrays, the table does not require any runtime code for its construction. It is just an initialized array. The pass 2 module uses the information in the table to determine the kind of information it seeks from the scanner, and the order of that information.

A table driven design could also be used for handling directives. This could be used in pass 1 as well as pass 2. However, if the number of assembler directives is small then it is not as important as for the handling of machine instructions.

Odds and ends: the main program

The main program in an assembler is not complex. The primary work is providing file parameters for function calls to the other modules and passing an error boolean from pass 1 to pass 2 so that an executable output is not generated when there is an error in the assembly language source.

Communication between Modules

The diagram below indicates the communication pathways between the modules of an assembler. For each arrow in the diagram, the module at the tail of the arrow plays the role of a client and the module at the head of the arrow plays the role of a server. This means that the client calls functions provided by the server.

The communication between the main program and the two pass modules is quite simple - the main program just calls pass 1 or 2 functions directing them to do their work in the proper order. The functions do not return any data except for possibly an error indication.

The communication between the main program and the scanner is also simple. The name of the file to be assembled is known directly in the main program. The main program either passes the name to the scanner or opens the file and passes it to the scanner. This could be done indirectly through the pass 1 and pass 2 modules.

The communication between pass 1 and pass 2 involves symbol table information. Pass 2 needs to get addresses for labels and values of defined constants from the symbol table. Although there is a fair amount of communication, the interface is simple. It can be a standard table interface.

The scanner is the communication focal point of an assembler. All of the other modules communicate with the scanner. The communication between the pass modules and the scanner is more complex than the communication along other pathways. For this reason, the best place to start working on assembler communication is the scanner client interface. This is an important aspect of assembler design. It cannot be taken lightly.