Computer Science 1511
Computer Science I
Programming Assignment
6
Text Files (35 points)
Due Wednesday, November 22,
2000
Introduction
In this assignment you will read and analyze a text file. The problem is to
read a text file character by character and print out a "token map'' of the
file, where tokens are meaningful objects from the file like words, numbers, or
quoted sequences of characters. In the token map you will print out a message
Xlength for each word, number, or quoted sequence where length is
the number of characters in the word or number and is the number of
characters between the quotes for the quoted sequence and X is:
- 'w' - if the token is a word
- 'n' - if the token is a whole number
- 'f' - if the token is a decimal number
- 'q' - if the token is 0 or 1 characters between single double
quotes
- 'Q' - if the token is multiple characters between single or double
quotes
- 'p' - for punctuation other than as used above
After each (whole or decimal) number token you should print out, between
parentheses, the value of that number. Also, at the end of the program you
should print out separately, the sum of all the whole and decimal number tokens
in the file. For example, suppose the text file contains the following:
<html>
<head>
<meta
http-equiv="Content-Type"
content="text/html;
charset=iso-8859-1">
<meta name="GENERATOR" content="Microsoft
FrontPage Express 2.0">
<title>We need a few good RatCog
programmers!</title>
</head>
<p><a
href="SCiP2000.pdf">Read a paper on RatCog</a>
(<a
href="SCiP2000.pdf">in .pdf format</a>; <a
href="SCiP2000.doc">in
Word
format</a>)</p>
</body>
</html>
This is just
some brazen test data. 123.9
1 87 963 665 1.9 9.107
"" '" '" "a" "
"
Your program should produce the following response:
1: w4
2: w4
3: w4 w4 w5 Q12
4: w7 Q29
5: w4 w4 Q9 w7 Q31
6: w5 w2 w4 w1 w3 w4 w6 w11 p w5
7: w4
8: w1 w1 w4 Q12 w4 w1 w5 w2 w6 w1 p w1
9: w4 Q12 w2 p w3 w6 w1 p w1 w4 Q12 w2
10: w4 w6 w1 p w1
11: w4
12: w4
13: w4 w2 w4 w4 w6 w4 w4 p f5(123.900002)
14: n1(1) n2(87) n3(963) n3(665) f3(1.900000) f5(9.107000)
15: q0 q0 q0 q1 q1
Sum of whole numbers in file: 1716
Sum of decimal numbers in file: 134.906998
Press enter to finish...
Note that:
- Words consists of sequences of characters 'A' to 'Z' or 'a' - 'z'.
- Whole numbers are sequences of characters consisting of digit characters
'0' to '9'.
- A decimal number is a whole number followed by a '.' followed by a whole
number. The '.' in a decimal number does not count as a punctuation character
in our token map.
- Ignore preceding signs (+/-) on both whole and decimal numbers.
- A quoted sequence is a quote, followed by any number of non-quote
characters (including punctuation and white space but not including the
newline) followed by another quote (the second quote does not have to be the
same type). Consider a newline to terminate a quote if the newline occurs
before the ending quote.
- Punctuation characters are: . , ! ? ; : ( )
- White-space (spaces, newlines, and tab characters) should be ignored,
except as above. (Newlines, of course, are also used to determine when the
next line occurs).
- You do not have to deal with any other characters.
- You will read the data from a test file that you create. You should create
more than one test file to test all aspects of your program code.
- You may write the ``token map'' to an output text file.
- You may (if you wish) print an extra line number at the end of the output.
How To Proceed
I suggest that you proceed in stages:
- Implement the program given in class to count the number of
characters in a line (it is in the on-line class notes) and run it to get
practice.
- Change the program to print out the line number 1 before the first line
and then print out a new line number after each newline character is
encountered.
- Change the program to print out a w every time a word starts, an
n every time a whole number starts, a f every time a decimal
number starts, a q every time a quoted sequence occurs, and a p
every time you encounter punctuation. (You are at the start of a word or
number if the previous character is NOT part of a word or number or a quoted
sequence).
- Add code to count the length of the word or number or quoted sequence (you
will need mechanisms to determine that a word or number or quoted sequence
continues to be read). Print out these lengths at the end of the
word/number/quoted sequence.
- When printing out the length of quoted sequences containing more than 1
character, print out Q, then the length for the multiple character sequence.
- While processing a whole number add code to determine the integer
``value'' of the number (see the end of the Chapter 7 notes). Print this value
after printing out the length of numbers.
- While processing a decimal number, add code to determine the float
``value'' of the number. You can use the whole number conversion on the whole
number portion of the number before and after the decimal point. Assume that
int X is the whole number portion before the decimal point and int Y is the
whole number portion after the decimal point. Also assume that int M is 10 if
X has 1 digit, M is 100 if X has 2 digits, M is 1000 if X has 3 digits, etc.
You can then convert: float F = (float) X + (float) Y / (float) M;
- Add code to separately add the values of these numbers.
While you are free to design your program any way you wish, you must follow
good top-down design principles. For example, you might write your program such
that each time the start of a word or number or quoted sequence was read a
function or functions would be called that would read to the end of the word or
number or quoted sequence.
What To Hand In
Hand in a lab report with a copy of each of your test data files and the
output for each. Include a second copy of each test data file. On this copy
underline the different types of tokens using different colors of ink (for
example, you might underline words in red, whole numbers in green, decimal
numbers in black, character sequences in purple, and punctuation in pink). Also
write a value indicating the length of the word/number/character sequence over
the word/number/character sequence.
EXTRA CREDIT
2 extra points - make it so that double single quote characters may appear in
a word. For example, don''t would count as one word of length 5 rather than
as a word of length 3 followed by an empty quoted sequence and a word of
length one. (Note that with this change, this will disallow using empty quoted
sequences with the single quote character).
2 extra points - allow multiple dashes (and ONLY dashes) as in -- (2
consecutive dashes) or --- (three consecutive dashes) etc. to be treated as a
single punctuation mark. Make it so that if the punctuation mark is not a single
character, your program will print out not only p, but the number of characters
in the punctuation, but only if the punctuation has more than 1 character.