I need a generic transliteration or substitution regex that will map extended latin characters to similar looking ASCII characters, and all other extended characters to \'\'
All brilliant answers. But none actually really worked. Putting extended characters directly in the source-code caused problems when working in terminal windows or various code/text editors across platforms. I was able to try out Unicode::Normalize, Text::Unidecode and Text::Unaccent, but wan't able to get any of them to do exactly what I want.
In the end I just enumerated all the characters I wanted transliterated myself for UTF-8 (which is most frequent code page found in my input data).
I needed two extra substitutions to take care of æ and Æ which I want mapping to two characters
For interested parties the final code is: (the tr is a single line)
$word =~ tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF
\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xE0\xE1\xE2\xE3\xE4
\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF8
\xF9\xFA\xFB\xFC\xFD\xFF/AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiionoo
oooouuuuyy/;
$word =~ s/\xC6/AE/g;
$word =~ s/\xE6/ae/g;
$word =~ s/[^\x00-\x7F]+//g;
Since things like Ď are not part of UTF-8, they don't occur nearly so often in my input data. For non-UTF-8 input, I chose to just loose everything above 127.