Crushing UTF-8 into ASCII

2010/07/03

Sometimes it really doesn’t matter about the lost context, especially when perl doesn’t recognise Ã± as a lower case Ã‘ unless you jump through all sorts of locale hoops, even though it’s in latin-1 and should be easy. This means I can’t just uc() the input to group all the case variations because uc(peÃ±a) => ‘PEÃ±A’. Then accurate case-sensitive parsers reading my output think my PEÃ±A is PEÃ‘A (which it should be). So if everything goes to PENA that’s fine for this case. This method uses core Perl 5.8.

It might not be the best method, but it does seem to work on my very large international input file when I wanted to convert PeÃ±a/PEÃ‘A to PENA and not PEÃ±A.

use Unicode::Normalize;
foreach () {
$_ = NFD(decode_utf8($_));
s/\pM//g;
s/[^\0-\x80]//g;
}

Crushing UTF-8 into ASCII

Comments

Leave a comment