Machine Coding

All information that we can communicate to other people is coded in some way. This is readily apparent when you consider our concepts of the physical world. To someone in France, our "two" is coded as "deux". Here we have two different codings for the same number. We have a single idea, but two different words for expressing it. One person can even use two codings for the same idea: "2" and "two" are both common English codes for the same idea.

Moreover, we use two different common media for coding. All of our words have both a written and a spoken version. We have a graphical system for coding information as visual images, and a phonetic system for coding information as sounds.

Like humans, computers need to have coding schemes for representing information. Most of the following sections describe some coding schemes for computer information. The last section deals with additional issues that arise in communication; that is, when information is transferred from one context to another.

Boolean data has two possible values: true or false. In some contexts, you can also think of boolean data as having values "yes" or "no". People who work with the circuitry in a computer often think of the values as 1 or 0.

For computers, boolean data is the fundamental type of data because it is so easy to code. Most computer circuits just use two different voltage levels, one for true and one for false. Since it is so important for computers, a unit of boolean data has a name: it is called a bit. Modern computers organize their memory into groups of 8 bits. These groups are called bytes.

Characters in a computer are coded as a group of bits. Over the history of computing, several different schemes have been used. Most of the schemes died out over time so that by the early 1990s there was only one major survivor called ASCII coding. ASCII character coding uses a byte to code each character. With 8 bits in a byte, each true or false, you can have 2⁸ = 256 distinct possible values. This is more than enough to have a separate code for each upper and lower case English letter, each digit, and each common punctuation mark.

More recently, software developers have made an effort to deal with alphabets other than the English alphabet. This has lead to a new multi-byte character coding scheme called Unicode. The Java programming language was the first major language to use Unicode for character coding. It is now supported by most major programming languages.

String data refers to sequences of characters. Computers make use of the same idea that we use in written text: just put character codes into a sequence of memory locations. For processing the characters in the sequence, it is necessary to know where it ends. There are two common conventions that deal with this need: prefixing the characters with an integer specifying the length, and appending a special character at the end. The C programming language and some of its derivatives use the latter approach, using a zero byte as a terminator.

From the beginning, computers have dealt with whole numbers, or integers, as an important part of their work. Except for the very earliest days of computing, a binary coding scheme has been used for dealing with integers. This scheme is similar to our decimal coding except that it is based on powers of 2 instead of powers of 10. In binary coding, each binary "digit" has two possible values, 0 or 1. These binary digits play roles similar to our decimal digits 0 through 9, each being a multiplier for some power of 2.

The binary integer coding scheme comes in two varieties: an unsigned scheme for nonnegative integers and a signed scheme that includes negative numbers.

There are two powerful reasons for using binary coding for integers. First, since each binary digit has two possible values, we can code each binary digit as a bit. In fact, the name "bit" originated as an abbreviation of "binary digit". The second reason is that the circuitry for doing binary arithmetic is much simpler and faster than circuitry for doing decimal arithmetic.

Just as humans use different ways of coding integers, so do computers. Part of the need for this arises from that fact that computers must be able to deal with human coding in order to be useful. When we send integer data to a program, we are sending in a string of decimal digits. When we look at computer integer output, we expect to see a string of decimal digits. In modern computers, there is no built-in capability for converting between strings of decimal digits and binary code, but it is easy to do the conversion using a sequence of basic arithmetic instructions. Compilers can automatically insert code for doing the conversion where it is needed.

Another reason for using different codings is dealing with numbers of different sizes. The size of the largest integer that you can represent depends on the number of bits that you use in the binary coding. To simplify the arithmetic circuitry, computers are designed to work with a small number of fixed size binary codings. The most common size in use today is 4 bytes (32 bits). This size gives you the capability of dealing with any 9 digit decimal integer. Most modern computers can also handle 8 byte (64 bit) integers. This size gives you the capability of dealing with any 18 digit decimal integer.

In early computers there were numerous ways of coding fractional decimal numbers, which created problems when there was a need for transferring data from one machine to another. In 1985, the Institute of Electrical and Electronics Engineers (IEEE) wrote a proposed standard, IEEE 754, with the aim of standardizing fractional decimal number coding. By the mid 1990s this standard was in use in all of the major commercial computers.

The IEEE 754 standard defines several coding schemes. All are based on a binary version of scientific notation. These schemes differ in their precision and range. The precision is the accuracy of representation, which can be specified as the number of significant digits. The range is the size of numbers that can be represented. The two most widely used are single precision numbers with a 4 byte (32 bit) code and double precision with an 8 byte (64 bit) code.

Machine instructions are the basic operations that can be performed by a computer processor. Most processors today support only a small set of simple machine instructions that can be combined in various ways to perform more complex operations. For example, with a single instruction you can add two numbers or multiply two numbers. Combined with instructions to support repetitive operations, these instructions can be used to do complex processing of tabular information. There are also machine instructions that can cause the processor to execute or not execute a sequence of instructions, based on a boolean condition. These instructions allow a program to take different actions depending on the values of the data that it is processing.

There are several popular families of computer processors in use today. Among the most popular are Sun Microsystems Sparc family, Intel's Pentium family, and the Motorola-IBM PowerPC family. Each of these families uses a different scheme of machine instruction coding. The modern philosophy of computer processor design dictates that machine instruction coding be chosen to make the processor run as fast as possible. Program translation technology has advanced to the point that there is little need to standardize the machine language.

Much of the data that we work with has a structure consisting of simpler components. In most cases the structure can be built up from simple types of data using a combination of two types of data sequences that are provides in most programming languages.

Arrays are one type of data sequence. In most programming languages they have fixed length and all of the items in the sequence have the same type. This means that each of the items in the sequence are coded in the same way. Strings, for example, are arrays of characters. The items are usually distinguished by an integer called an index.

The other type of sequence goes by different names in different programming languages: records, structs, and objects. Whatever they are called, they can contain data items with different types (encodings). The items are usually given different names. The coding of a machine instruction can be viewed as a struct with one data item specifying the operation to be performed and others specifying the data that that will be used in the operation and where the result should be put.

Arrays and record, structs, or objects are coded by concatenating the codes for their data items to from a longer string of bits. Part of the reason for having types in programming languages is to ensure that machine code generated by a compiler is able to access components of structured data and use the appropriate coding for the components.

Without communication, encoding data is of little use. At the very least, a computer need to be able to communicate with humans who are using the computer.

Communication requires agreement on a language - a scheme for encoding data. But a language, by itself, is not enough to deal with all of the problems of communication.

Communication is characterized by a back-and-forth exchange of information between two or more parties. It is interactive in the sense that the information exchanged later in the communication may be dependent on information exchanged earlier. Hopefully, the answer given by a student in response to an instructor's question depends on the content of the question.

A protocol is a set of rules for dealing with additional problems arising in communication. A protocol may need to deal with the following issues:

How does one party respond to information sent by another party?
What happens if information is lost or garbled?
If several parties are sharing a communication channel, how do they determine who gets to use it next?