** This is specific to BSP v0.3, v0.4 and does not apply to NSP v0.5 **

** NSP v0.5 allows a user to specify their own tokenization scheme and 
thereby avoids these kinds of problems. **

BSP is geared to English language/ASCII text. We are working to change 
by using the Unicode support that comes with Perl version 5.6. Until the 
next version of BSP is available, the following may give some guidance on 
how to proceed with other alphabets. 

The problem lies in line 165 of count.pl. The character \w matches the 
characters A-Z, a-z, and 0-9. This is the "alphanumeric" character set 
supported in standard ASCII. So we consider words to be strings of 
alphanumerics, which unfortunately excludes many alphabets. 

Here's an idea (courtesy of Michal Kren) - you can make the following 
modification to line 165 of count.pl (in v0.3):

while ( /(([\w\x80-\xff]+)|[,.!?;:])/g )

This will extend the "matching" for words to include ASCII characters
numbered 127 to 256 (the upper half of the table). This includes a
number of accented characters and other alphabets, so it might possibly
include the characters you are interested in. It may also result
in words that include punctuation and other characters, but this
is at least a stop gap.

If you are adventursome, you could try and include Unicode support
on your own. You will need to use Perl version 5.6 (or better) and
the utf8 pragma. Then you can use \p{IsWord} and \p{IsPunct} which
are Unicode character classes. This is relatively simple and seems
to work. However, we have run into problems thereafter, in particular
with count.pl and the hashing that it does. So, we continue to work
on this but if you have any insights on the matter we'd be grateful
to hear of them!

Your comments and suggestions would be most welcome.

02/16/01
Ted Pedersen
tpederse@d.umn.edu