NAME giga2plain.pl SYNOPSIS giga2plain.pl is a document format converter that transforms GigaCorpus files into plain text. DESCRIPTION The program accepts input file(s) in the English GigaWord format and gives output in plain text The English GigaWord format looks like: This is a test WASHINGTON, May 12 (AFP)

To demonstrate the options available we have incorporated in our description this example.

We include punctutation ``examples'' such as the period, comma, and quotations.

We also include some numbers to demonstrate how the 1 program takes numbers 100 percent into consideration no matter if you have 10.0 words or 100,000 words. The plain text output without defaults would look as follows: This is a test WASHINGTON , May 12 ( AFP ) To demonstrate the options available we have incorporated in our description this example . We include punctutation " examples " such as the period , comma , and quotations . We also include some numbers to demonstrate how the 1 program takes numbers 100 percent into consideration no matter if you have 10 . 0 words or 100 , 000 words. Only the paragraphs, the dateline and the headline text is kept. The sentences between the paragraph tags are put on a single line. Punctuation is seperated from the words. The '' and `` quotations are replaced with " quotation. There are seven options that are implemented. 1. --lowercase which converts all of the text to lower case including the headline and the dateline this is a test washington , may 12 ( afp ) to demonstrate ... 2. --removeheadline removes the headline from the text WASHINGTON , May 12 ( AFP ) To demonstrate ... 3. --removedateline removes the dateline from the text This is a test To demonstrate ... 4. --removenumbers removes all of the numbers from the text ... We also include some numbers to demonstrate how the program takes numbers percent into consideration no matter if you have words or words . ... 5. --removepunctuation removes all of the punctuation from the text ... We include punctutation examples such as the period comma and quotations ... 6. --replacenumbers replaces numbers with the NUMBER tag ... We also include some numbers to demonstrate how the NUMBER program takes numbers NUMBER percent into consideration no matter if you have NUMBER words or NUMBER words . ... 7. --replacepunctuation replaces the punctuation with the PUNCTUATION tag ... We include punctutation PUNCTUATION examples PUNCTUATION such as the period PUNCTUATION comma PUNCTUATION and quotations PUNCTUATION ... AUTHOR Bridget T McInnes, bthomson@cs.umn.edu University of Minnesota Ted Pedersen, tpederse@d.umn.edu, University of Minnesota, Duluth BUGS COPYRIGHT Copyright (C) 2005-2006, Ted Pedersen and Bridget T McInnes This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.