Language Translation

Language translation generally involves at least two major components.

A scanner (or lexical analyser)
A scanner groups characters into tokens, which are low-level units of meaning in the source text.
A parser
A parser builds an internal representation of the text structure. This representation is called a parse tree.

If the interpreted language has variables, types, functions, or procedures then an interpreter needs an additional type of major component.

Symbol tables
A symbol table is used to record information about identifiers. An identifier is a type of token that is used for naming variables, types, functions, and procedures.

A scanner (or lexical analyser) groups characters into tokens, which are low-level units of meaning in the source code. Tokens are similar to words and punctuation marks in natural languages.

A scanner typically provides an iterator interface for use by the parser. The interface has an implicit current position in the source code text, which is either at a token or beyond the end of the text. It provides operations for

determining if there is a token at the current position,
classifying the token at the current position,
retrieving the value of the token at the current position, and
advancing the current position.

Scanners will often omit the first operation. Instead, they return a special end token for the token at the current position when the current position is beyond the end of the text. For some languages the scanner may also need to be able to retrieve tokens beyond the current position without advancing. This is called look ahead.

A symbol table is used to record information about identifiers, one of the types of tokens in the source code. Identifiers are use for naming variables, types, and subprograms so the symbol tables must be capable of recording different kinds of information. The symbol tables must be structured to reflect the different kinds of name scopes in the language.

Many high-level languages have some form of block structuring which determines a variety of nested scopes for the identifiers. Since the meaning of an identifier can depend on the scope, symbol tables usually organized as a stack of simple tables. The simple tables are pushed onto the stack when parsing enters a new scope and popped when parsing leaves a scope. When information about an identifier is needed, the simple tables are scanned from top to bottom for entries about the identifier. The first entry encountered is the one that used. In short, a Chain of Responsibility design pattern is used, with the chaining following the nesting.

There are scopes associated with each type in an object-oriented program in a typed, class-based OOL. In the following Java code example, there is a scope for the System class, in which the member out is declared. Since out is declared to have type PrintStream, there is another scope for the PrintStream type. The println() method is defined in this last scope.

      System.out.println("Hello, world!");

A compiler uses a symbol table for each of these type scopes. There is a type table that contains all of these symbol tables as entries. The entries are keyed by the type name.

Modern parsers usually build an internal structural representation of the source code. This representation is called a parse tree. To understand parse trees, we need to look first at some grammar definitions. We will look at definitions for expressions and statements in the C programming language. These definitions are only an approximation to the truth.

In the following syntax definitions the color red indicates a token type and the color blue indicates a nonterminal type.

An expression is a sequence of one or more terms separated by + and - operators
A term is a sequence of one or more factors separated by *, /, and % operators
A factor is a one of the following:
- An identifier-expression
- A numeric-literal
- An expression between ( and )
A identifier-expression is an identifier followed by an optional parameters
A parameters is a parameter-list enclosed by ( and )
A parameter-list is a sequence of zero or more expressions separated by ,s

In the following syntax definitions the color red indicates a token type and the color blue indicates a nonterminal type. These definitions are far from complete.

A statement is one of the following:
- A block-statement
- An assignment-statement
- An if-statement
- A while-statement
- A for-statement
- A switch-statement
- A break-statement
- A continue-statement
- A return-statement
A block-statement is a sequence of statements between { and }
A while-statement has the following form:
- while parenthesized-expression statement

A parse tree is a structural representation of a particular source text as interpreted in a specific grammar. It has node types for each type of grammar component in the language. Each subtree corresponds to a structural component of the source code. There are two general types of nodes.

Leaves -
The leaves of a parse tree are tokens from the source text. They are Leaves in the Composite desgn pattern and TerminalExpressions in the Iterpreter design pattern.
Interior Nodes -
The interior nodes of a parse tree are roots of subtrees corresponding to higher-level structures in the source text. The type of an interior node indicates the grammar rule that governs the children of that node. Interior nodes are Composites in the Composite design pattern and NonterminalExpressions in the Interpreter design pattern.

The diagram to the left shows a simple parse tree for the simple expression x*y. Leaves (tokens) are shown in red and interior nodes are shown in blue.

Construction of parse tree nodes in the numeric expressions grammar are shown along with diagrams showing typical partial parse trees.

Parse Tree Construction

Constructing Factor Nodes