How can I change extended latin characters to their unaccented ASCII equivalents?

前端未结

关注

 5  1522

萌比男神i 2021-01-18 09:50

I need a generic transliteration or substitution regex that will map extended latin characters to similar looking ASCII characters, and all other extended characters to \'\'

5条回答

Happy的楠姐 (楼主)

2021-01-18 10:31
All brilliant answers. But none actually really worked. Putting extended characters directly in the source-code caused problems when working in terminal windows or various code/text editors across platforms. I was able to try out Unicode::Normalize, Text::Unidecode and Text::Unaccent, but wan't able to get any of them to do exactly what I want.

In the end I just enumerated all the characters I wanted transliterated myself for UTF-8 (which is most frequent code page found in my input data).

I needed two extra substitutions to take care of æ and Æ which I want mapping to two characters

For interested parties the final code is: (the tr is a single line)
```
$word =~ tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF
\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xE0\xE1\xE2\xE3\xE4
\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF8
\xF9\xFA\xFB\xFC\xFD\xFF/AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiionoo
oooouuuuyy/;
$word =~ s/\xC6/AE/g;
$word =~ s/\xE6/ae/g;
$word =~ s/[^\x00-\x7F]+//g;
```
Since things like Ď are not part of UTF-8, they don't occur nearly so often in my input data. For non-UTF-8 input, I chose to just loose everything above 127.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...