Machine Coding and Translation


Coding Data

All information that we can communicate to other people is coded in some way. This is readily apparent when you consider our concepts of the physical world. To someone in France, our "two" is coded as "deux". Here we have two different codings for the same number. We have a single idea, but two different words for expressing it. One person can even use two codings for the same idea: "2" and "two" are both common English codes for the same idea.

Moreover, we use two different common media for coding. All of our words have both a written and a spoken version. We have a graphical system for coding information as visual images, and a phonetic system for coding information as sounds.

Like humans, computers need to have coding schemes for representing information. The following subsections describe some coding schemes for computer information.

Boolean Data

Boolean data has two possible values: true or false. In some contexts, you can also think of boolean data as having values yes or no.

For computers, boolean data is the fundamental type of data because it is so easy to code. Most computer circuits just use two different voltage levels, one for true and one for false. Since it is so important for computers, a unit of boolean data has a name: it is called a bit.

Character Data

Characters in a computer are coded as a group of bits. Over the history of computing, several different schemes have been used. Most of the schemes died out over time so that by the early 1990s there was only one major survivor called ASCII coding. ASCII character coding used a group of 8 bits to code a character. With 8 bits, each true or false, you can have 28 = 256 distinct possible values. This is more than enough to have a separate code for each upper and lower case English letter, each digit, and each common punctuation mark. Since the ASCII code wads used widely, the 8 bit grouping got its own name: it is called a byte.

More recently, software developers have made an effort to deal with alphabets other than the English alphabet. This has lead to a new character coding scheme called Unicode. The Java programming language was the first major language to use Unicode for character coding.

String Data

String data refers to sequences of characters. Computers make use of the same idea that we use in written text: just put character codes into a sequence of memory locations. For processing the characters in the sequence, it is necessary to know where it ends. There are two common conventions that deal with this need: prefixing the characters with an integer specifying the length, and appending a special character at the end. The C programming language and some of its derivatives use the latter approach, using a zero byte as a terminator.

Whole Number Data

From the beginning, computers have dealt with whole numbers, or integers, as an important part of their work. Except for the very earliest days of computing, a binary coding scheme has been used for dealing with integers. This scheme is similar to our decimal coding except that it is based on powers of 2 instead of powers of 10. In binary coding, each binary "digit" has two possible values, 0 or 1. These binary digits play roles similar to our decimal digits 0 through 9, each being a multiplier for some power of 2.

There are two powerful reasons for using binary coding for integers. First, since each binary digit has two possible values, we can code each binary digit as a bit. In fact, the name "bit" originated as an abbreviation of "binary digit". The second reason is that the circuitry for doing binary arithmetic is much simpler and faster than circuitry for doing decimal arithmetic.

Just as humans use different ways of coding integers, so do computers. Part of the need for this arises from that fact that computers must be able to deal with human coding in order to be useful. When we send integer data to a program, we are sending in a string of decimal digits. When we look at computer integer output, we expect to see a string of decimal digits. In modern computers, there is no built-in capability for converting between strings of decimal digits and binary code, but it is easy to do the conversion using a sequence of basic arithmetic instructions. Compilers can automatically insert code for doing the conversion where it is needed.

Another reason for using different codings is dealing with numbers of different sizes. The size of the largest integer that you can represent depends on the number of bits that you use in the binary coding. To simplify the arithmetic circuitry, computers are designed to work with a small nuber of fixed size binary codings. The most common size in use today is 4 bytes (32 bits). This size gives you the capability of dealing with any 9 digit decimal integer. Most modern computers can also handle 8 byte (64 bit) integers. This size gives you the capability of dealing with any 18 digit decimal integer.

Fractional Decimal Number Data

In early computers there were numerous ways of coding fractional decimal numbers, which created problems when there was a need for transferring data from one machine to another. In 1985, the Institute of Electrical and Electronics Engineers (IEEE) wrote a proprosed standard, IEEE 754, with the aim of standardizing fractional decimal number coding. By the mid 1990s this standard was in use in all of the major commercial computers.

The IEEE 754 standard defines several coding schemes. All are based on a binary version of scientific notation. These schemes differ in their precision and range. The precision is the accuracy of representation, which can be specified as the number of significant digits. The range is the size of numbers that can be represented. The two most widely used are single precision numbers with a 4 byte (32 bit) code and double precision with an 8 byte (64 bit) code.

Machine Instructions

Machine instructions are the basic operations that can be performed by a computer processor. Most processors today support only a small set of simple machine instructions that can be combined in various ways to perform more complex operations. For example, with a single instruction you can add two numbers or multiply two numbers. Combined with instructions to support repetitive operations, these instructions can be used to do complex processing of tabular information. There are also machine instructions that can cause the processor to execute or not execute a sequence of instructions, based on a boolean condition. These instructions allow a program to take different actions depending on the values of the data that it is processing.

There are several popular families of computer processors in use today. Among the most popular are Sun Microsystems Sparc family, Intel's Pentium family, and the Motorola-IBM PowerPC family. Each of these families uses a different scheme of machine instruction coding. The modern philosophy of computer processor design dictates that machine instruction coding be chosen to make the processor run as fast as possible. Translation technology, described in the next section, has advance to the point that there is little need to standardize the machine language.

Data Types and Programming

The data in a computer is usually stored in memory without any indication of its coding scheme. Four bytes of memory could represent four characters, or a C/C++/Java int, or a C/C++/Java float. There is no way of telling what the four bytes mean just by looking at their bits. That is, computer data is untyped.

Computer operations, on the other hand, are necessarily typed. The bit manipulations involved in adding two integers are different from the bit manipulions involved in adding two real numbers due to the differences in the way that they are coded.

When a programmer is programming in a high-level language, such as C, C++, or Java, the compiler assigns locations for varibles and maintains a symbol table with information about data locations and types. Programming can also be done in assembly language, which is closer to the language of the machine. Since the computer data is not typed, it is the responsibility of an assembly language programmer to keep track of data types and use the appropriate operation for the type.

Program Translation

Modern programming languages attempt to give programmers the capability of doing complex things with a computer, while writing instructions for the computer in a language close to their own natural language. For example, most programmers are comfortable enough with standard mathematical language to use expessions such as 1/3*l*w*h (the volume of a pyramid with base length l, base width w, and height h). They do not care to know the sequence of machine instructions that is needed to evaluate this expression, much less the machine coding for those instructions. Also, when a programmer wants to specify a numerical value, they prefer to specify it as a string of decimal digits rather than machine binary code. Furthermore, a programmer has no interest in the character coding used for the decimal digits. Thus programming languages require powerful mechanisms for translating the language of a programmer into a language that the machine understands.

Historically, there have been two primary approaches to the problem of translating programming languages into machine language: compilers and interpreter. These two approaches can also be combined into a two-step approach using virtual machines. These approaches and and optimization technique are described in the following subsections.

Compilers

The distinguishing characteristic of a compiler is that it is used once to translate a program into machine language. The machine language is saved in a file so that it can be executed as many times as you want without having to be translated again. This allows program execution at the full speed of the computer processor by eliminating the translation overhead. The drawback is that the translated program can only run on the family of processors for which the compiler was designed.

Interpreters

The distinguishing characteristic of an interpreter is that it translates a program into machine language every time that the program is executed. There is no saved translation. If an interpreter is written for each machine then you can move the program from one machine to another and still expect it to execute correctly. It is important to note that even though you need a different interpreter on each machine, the interpreter is the only software that is not portable. Programs written in the language for which the interpreter is designed are portable.

The drawback of interpreters is that the program is translated into machine instructions every time it is executed. This can slow down execution by a factor of a few hundred.

Virtual Machines

A virtual machine is a machine that is implemented in software rather than hardware. The idea is to introduce an artificial machine language that is close to the language of real processors. Then the translation overhead for converting from the artificial (or virtual) machine language to a real machine language can be reduced significantly.

With a virtual machine, program translation is a two-step process. First, a program is compiled into the language of the virtual machine. When the program is executed, the virtual machine language is interpreted into the language of the real machine. This gives you the portability of an interpreter with signicantly reduced execution overhead. With a virtual machine interpreter, the execution slowdown can be reduced to a factor of 5 to 25, depending on the nature of the program.

The only non-portable software in a program language that uses a virtual machine is the virtual machine interpreter. It is considerable simpler that an interpreter that translates directly from the programming language to real machine language because the virtual machine is already close to real machine language. The compiler from the programming language to the virtual machine language can be itself be compiled to virtual machine language so that it is portable to any machine with a virtual machine interpreter.

Historically, the idea of a virtual machine has been around for about 30 years. In the 1970s, the Pascal programming language used virtual machines on some platforms to allow easy porting of Pascal programs from one machine to another. One virtual machine language, called P-code, was used to supprt development of Pascal programs on early microprocessors. In addition to the portability advantage, P-code was exceptionally compact, an important consideration given the small memory capacities at that time.

Just-In-Time Compilation

Portability considerations are a powerful motivation for using interpreters. However, even with a virtual machine, the translation overhead for interpretation may be too much for some programs. This has lead to a lot of research on how to reduce the overhead. Within the past few years, one idea has moved from the realm of research to the practical realm: Just-In-Time (JIT) compilation.

A JIT compiler is a component of an interpreter, compiling virtual machine language (usually) into real machine language. Unlike a normal compiler, it does not save the compilation results in a file. Instead, they are saved in memory only for the duration of a program's execution. When a potion of code is first executed, it is first translated to real machine language, just in time to be executed. The next time that portion of code is executed, the saved machine language instructions are executed directly.

You can get a pretty good idea of why JIT compilation is so effective by considering the speed of modern processors and the size of the programs that they execute. Recent processors are capable of executing about four billion instructions per second. A large program could contain 10 million instructions. This means that in order to execute for one second, each instruction must be executed 400 times on the average. Generally, programs that take a reasonable time to develop can only have significant run time problems if they are executing some portions of their code a large number of times.

With JIT compilation, you only have translation overhead the first time a portion of code is executed. After that, you are executing the code at full speed. When you average the run time over all of the times that the code portion is executed, the overhead is just a small percentage.

Mixing Languages

The translation overhead of a virtual machine can also be reduced by defining a standardized way of incorporating code that is compiled into the machine language of the underlying machine. For example, the Java language has a Java Native Interface (JNI) standard that defines how code compiled from other languages can be incorporated into a Java program. Using JNI, a Java programmer can write most of the code in Java but use a small amount of C or assembly language code for the time-critical parts of the program.


Page URL: http://www.d.umn.edu/~gshute/asm/coding.html
Page Author: Gary Shute
Last Modified: Tuesday, 26-Jun-2007 12:29:08 CDT
Comments to: gshute@d.umn.edu