** This is specific to BSP v0.3, v0.4 and does not apply to NSP v0.5 ** ** NSP v0.5 allows a user to specify their own tokenization scheme and thereby avoids these kinds of problems. ** BSP is geared to English language/ASCII text. We are working to change by using the Unicode support that comes with Perl version 5.6. Until the next version of BSP is available, the following may give some guidance on how to proceed with other alphabets. The problem lies in line 165 of count.pl. The character \w matches the characters A-Z, a-z, and 0-9. This is the "alphanumeric" character set supported in standard ASCII. So we consider words to be strings of alphanumerics, which unfortunately excludes many alphabets. Here's an idea (courtesy of Michal Kren) - you can make the following modification to line 165 of count.pl (in v0.3): while ( /(([\w\x80-\xff]+)|[,.!?;:])/g ) This will extend the "matching" for words to include ASCII characters numbered 127 to 256 (the upper half of the table). This includes a number of accented characters and other alphabets, so it might possibly include the characters you are interested in. It may also result in words that include punctuation and other characters, but this is at least a stop gap. If you are adventursome, you could try and include Unicode support on your own. You will need to use Perl version 5.6 (or better) and the utf8 pragma. Then you can use \p{IsWord} and \p{IsPunct} which are Unicode character classes. This is relatively simple and seems to work. However, we have run into problems thereafter, in particular with count.pl and the hashing that it does. So, we continue to work on this but if you have any insights on the matter we'd be grateful to hear of them! Your comments and suggestions would be most welcome. 02/16/01 Ted Pedersen tpederse@d.umn.edu