Assembler Organization

Two important steps are involved in designing software: dividing the software into smaller, more manageable components, and determining how the components will communicate. For both of these steps, the designer needs to keep the major kinds of software functionality in mind.

In this web page we will first look at a simple assembler - one without complications arising from pseudoinstruction expansion that does not deal with problems arising when a program is assembled from multiple assembly language files. This gives rise to a design that was typical of early assemblers. It was especially well-suited for machines with limited memory.

Later, we will look at how an assembler needs to be redesigned to deal with more complicated requirements. The resulting organization is closer to that needed for more general language translation applications such as compilers and web browsers.

Responsibilities

A simple assembler has four major responsibilities.

Generate Machine Code

This is the primary purpose of an assembler. The assembler usually generates a file copy of data and machine instructions that will later be loaded into a computer in preparation for execution. The file copy must also contain a starting address - the address of the first instruction to be executed.

Provide Error Information

Assembly language programming is difficult. Assembly language programmers make mistakes. Some of these mistakes can be caught by the assembler. The ease of assembly language programming is dependent to a large extent on the quality of assembler error messages.

Provide Machine Code Information

The assembler cannot catch all programmer errors. Some can only be detected by executing the assembled program. Then assembler can, however, provide information about the machine code that aids the programmer in debugging runtime errors. For difficult debugging problems, the programmer may need to know what code was generated and where data and instructions are located in memory.

Assign Memory

Early assemblers forced programmers to assign memory addresses for all data and keep track of addresses assigned for instructions. Modern assembler allow programmers to use symbols (usually statement labels) to represent addresses for data or branch or jump targets. This makes the programmer's life much simpler. However, addresses are required for machine code generation. Thus the assembler must pick up the responsibity of assigning addresses to program symbols. These assignments must be remembered for use when the symbols appear in instruction operands.

Assembler Design

Much of the breakdown of an assembler into components is driven by three major considerations.

Assembly Passes

The simplest organization for an assembler is a two-pass organization. The need for this organization arises from consideration of the assembler functions and the nature of assembly language code.

The memory address for a label cannot be assigned until its definition has been read, but code generation for jumps and branches requires knowing the address of the jump or branch target. Thus the assembler makes two passes through the input. During the first pass, the assembler is assigning memory addresses to labels. Machine code is generated during the second pass.

Dealing with Characters

The input to an assembler consists of a stream of characters which represent assembly information in several different ways: integers, real numbers, labels, quoted characters and strings, register names, and various kinds of punctuation. If character processing is mixed in with algorithms for building symbol tables or algorithms for code generation, the result is a complex mess that would challenge even the best programmers. The problem is that this organization (or disorganization) forces programmers to deal with two distinct kinds of abstraction simultaneously. The result is difficult algorithm development, a large number of errors, and difficulty in debugging. These problems are magnified if the assembler requires maintenance at a later time.

A good general design principle is to assign responsibility for different kinds of abstractions to different program components. This principle is crucial for large designs, but is a good practice for smaller designs as well. Dividing responsibilities for different kinds of abstractions into different components allows programmers to focus their attention on one aspect of the problem at a time.

For an assembler, the implication is that there should be a component dedicated to handling text at the level of characters. This component is called a lexical analyser or scanner.

Saved Information

An assembler needs to save information generated in early passes that is needed in later passes. Like more complex language translators, the saved information is about labels (symbols) that appear in the source code. The information is simple for an assembler - just addresses, so a simple table suffices. The component is called a symbol table.

Assembler Components

Thus an assembler has four primary components, a scanner, a symbol table, a pass 1 component, and a pass 2 component. These are coordinated by a fifth component, a main program.

Lexical analysis: the scanner

The primary purpose of the scanner component is processing characters into higher level units that are more meaningful for the pass 1 and pass 2 components. These units are called tokens. The process of forming these groups is called lexical analysis. The need for a scanner arises in most programs that deal with complex input.

The Symbol Table

An assembler needs to assign memory for its instructions and explicitly declared data. An assembly language program contains label definitions that mark the memory locations. The labels can be used elsewhere in the program to refer to the marked locations.

The symbol table is a simple table structure whose entries contain memory addresses keyed by the program labels. When the assembler determines the address for a label it adds an entry into the sysmbol table. When it encounters a use of the label it can look up the address in the symbol table.

Populating the Symbol Table: Pass 1

During pass 1, the input is read and memory addresses are asigned to program labels. Memory is allocated sequentially so that the pass 1 component can use a location counter. For each input statement, this counter is incremented by the size of memory allocated. Whenever a label is encountered, it is recorded in the symbol table. The address assigned is the value of the location counter at the beginning of the statement.

Most assemblers need to do some processing of assembler directives to determine the size of the data involved. Modern RISC processors have fixed instruction lengths, so machine instructions require very little processing during pass 1.

Some assemblers keep data and instructions in separate regions of memory. If this is done then two separate location counters are used, one for data and one for instructions.

The symbol table could be treated as a subcomponent of the pass 1 component. This choice has little if any effect on the complexity of coding, but a separately compiled subcomponent does facilitate separate testing.

Generating Output: Pass 2

The primary effort in pass 2 is translating instructions into machine code. If the assembler is mixing data and instructions in the same area of memory then translation of data must be done in pass 2. For assemblers that use separate ares of memory for data and instructions, translation of data could be moved into pass 1. This is somewhat advantageous in that it results in a better balance of the complexities of the two pass components.

Most of the error reports generated by an assembler are generated during pass 2. These reports can be interleaved with assembler listing output so that the assembly language programmer can readily associate an error report with the code that caused it.

While pass 2 is running, machine code is saved in a byte array (two arrays if data and instructions are kept separate). If there are no errors then at the end of pass 2 the array(s) is written to a file in binary form. It can also be displayed in hexadecimal form as a listing for the assembly language programmer.

For large instruction sets, a table-driven design is useful. In this approach, instructions are classified according to their operand types. This classification information, along with other coding information, is stored in a table. The pass 2 component uses the information in the table to determine the kind of information it needs from the scanner, and the order of that information.

A table driven design could also be used for handling directives. This could be used in pass 1 as well as pass 2. However, if the number of assembler directives is small then it is not as important as for the handling of machine instructions.

Coordinating the Primary Components: The Main Program

The main program in an assembler is not complex. The primary work is providing file parameters for function calls to the other components and dealing with errors. Typically pass 1 returns an error boolean that can be forwarded to pass 2 so that an executable output is not generated when there is an error in the assembly language source.

Communication between Components

The communication between the main program and the two pass components is quite simple - the main program just calls pass 1 or 2 functions directing them to do their work in the proper order. The functions do not return any data except for an error indication.

The communication between the main program and the scanner is also simple. The name of the file to be assembled is known directly in the main program. The main program either passes the name to the scanner or opens the file and passes it to the scanner. This could be done indirectly through the pass 1 and pass 2 components.

The communication between pass 1 and pass 2 is indirect through the symbol table. Pass 2 needs to get addresses for labels and values of defined constants from the symbol table. Although there is a fair amount of communication, the interface is simple. It is a standard table interface.

Complexities

The design described above is suitable for a simple assembly language. It has the advantage that the only stored information is the symbol table. This reduces the amount of memory used, which was crucial in early processors.

Today, reducing the memory footprint is not as important. Using more memory to save results of earlier processing can simplify design. This is especially important when software evolves to support more complex functionality. We will consider one kind of change that is common to many kinds of application that deal with language translation: retaining an internal representation of the structures represented by a language. The need for this is exemplified by a need for pseudoinstruction expansion in most modern assembly languages. Pseudoinstruction expansion results in a need for additional passes through the assembly language code. The strategy for handling these additional passes leads to a design that can be adapted for more complex language translation applications.

Pseudoinstructions

Modern RISC processors have a limited instruction set. Assemblers augment the instruction set with pseudoinstructions. These pseudoinstructions can differ from the machine instructions in at least three ways:

Strategy

The general strategy for dealing with complexities is to add a structure for representing partially processed source code. Each assembler pass is designed to modify this representation in various ways:

Adaptation

The multiple pass organization is used by most modern compilers. Most high-level language compilers use a parse tree as the structure for representing partially processed code.

The parse tree is not only useful in compilers. Modern integrated development environments (IDEs) also build a parse tree to represent the source code. The parse tree can be used for more than just code generation. Modern IDEs can also use a parse tree for automatic formatting of source code.

Parse trees are usually designed to support a Visitor design pattern. This lets designers add new kinds of functionality without having to modify the parse tree code.