Computer Science 1511
Computer Science I

Programming Assignment 6
Text Files (35 points)
Due Wednesday, November 22, 2000

Introduction

In this assignment you will read and analyze a text file. The problem is to read a text file character by character and print out a "token map'' of the file, where tokens are meaningful objects from the file like words, numbers, or quoted sequences of characters. In the token map you will print out a message Xlength for each word, number, or quoted sequence where length is the number of characters in the word or number and is the number of characters between the quotes for the quoted sequence and X is:

After each (whole or decimal) number token you should print out, between parentheses, the value of that number. Also, at the end of the program you should print out separately, the sum of all the whole and decimal number tokens in the file. For example, suppose the text file contains the following:

<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage Express 2.0">
<title>We need a few good RatCog programmers!</title>
</head>
<p><a href="SCiP2000.pdf">Read a paper on RatCog</a> (<a
href="SCiP2000.pdf">in .pdf format</a>; <a href="SCiP2000.doc">in
Word format</a>)</p>
</body>
</html>
This is just some brazen test data. 123.9
1 87 963 665 1.9 9.107
"" '" '" "a" " "

Your program should produce the following response:

   1: w4 
   2: w4 
   3: w4 w4 w5 Q12 
   4: w7 Q29 
   5: w4 w4 Q9 w7 Q31 
   6: w5 w2 w4 w1 w3 w4 w6 w11 p w5 
   7: w4 
   8: w1 w1 w4 Q12 w4 w1 w5 w2 w6 w1 p w1 
   9: w4 Q12 w2 p w3 w6 w1 p w1 w4 Q12 w2 
  10: w4 w6 w1 p w1 
  11: w4 
  12: w4 
  13: w4 w2 w4 w4 w6 w4 w4 p f5(123.900002) 
  14: n1(1) n2(87) n3(963) n3(665) f3(1.900000) f5(9.107000) 
  15: q0 q0 q0 q1 q1 
Sum of whole numbers in file: 1716
Sum of decimal numbers in file: 134.906998

Press enter to finish...

Note that:

How To Proceed

I suggest that you proceed in stages:

While you are free to design your program any way you wish, you must follow good top-down design principles. For example, you might write your program such that each time the start of a word or number or quoted sequence was read a function or functions would be called that would read to the end of the word or number or quoted sequence.

What To Hand In

Hand in a lab report with a copy of each of your test data files and the output for each. Include a second copy of each test data file. On this copy underline the different types of tokens using different colors of ink (for example, you might underline words in red, whole numbers in green, decimal numbers in black, character sequences in purple, and punctuation in pink). Also write a value indicating the length of the word/number/character sequence over the word/number/character sequence.

EXTRA CREDIT

2 extra points - make it so that double single quote characters may appear in a word. For example, don''t would count as one word of length 5 rather than as a word of length 3 followed by an empty quoted sequence and a word of length one. (Note that with this change, this will disallow using empty quoted sequences with the single quote character).

2 extra points - allow multiple dashes (and ONLY dashes) as in -- (2 consecutive dashes) or --- (three consecutive dashes) etc. to be treated as a single punctuation mark. Make it so that if the punctuation mark is not a single character, your program will print out not only p, but the number of characters in the punctuation, but only if the punctuation has more than 1 character.