How can I change extended latin characters to their unaccented ASCII equivalents?

前端 未结 5 1522
萌比男神i
萌比男神i 2021-01-18 09:50

I need a generic transliteration or substitution regex that will map extended latin characters to similar looking ASCII characters, and all other extended characters to \'\'

5条回答
  •  Happy的楠姐
    2021-01-18 10:31

    All brilliant answers. But none actually really worked. Putting extended characters directly in the source-code caused problems when working in terminal windows or various code/text editors across platforms. I was able to try out Unicode::Normalize, Text::Unidecode and Text::Unaccent, but wan't able to get any of them to do exactly what I want.

    In the end I just enumerated all the characters I wanted transliterated myself for UTF-8 (which is most frequent code page found in my input data).

    I needed two extra substitutions to take care of æ and Æ which I want mapping to two characters

    For interested parties the final code is: (the tr is a single line)

    $word =~ tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF
    \xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xE0\xE1\xE2\xE3\xE4
    \xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF8
    \xF9\xFA\xFB\xFC\xFD\xFF/AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiionoo
    oooouuuuyy/;
    $word =~ s/\xC6/AE/g;
    $word =~ s/\xE6/ae/g;
    $word =~ s/[^\x00-\x7F]+//g;
    

    Since things like Ď are not part of UTF-8, they don't occur nearly so often in my input data. For non-UTF-8 input, I chose to just loose everything above 127.

提交回复
热议问题