I\'ve looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn\'t follow the Unicode standard concerning RegExp, etc.) and haven\'t really found a concrete
The accented Latin range \u00C0-\u017F
was not quite enough for my database of names, so I extended the regex to
[a-zA-Z\u00C0-\u024F]
[a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars
I added these code blocks (\u00C0-\u024F
includes three adjacent blocks at once):
\u00C0-\u00FF
Latin-1 Supplement\u0100-\u017F
Latin Extended-A\u0180-\u024F
Latin Extended-B\u1E00-\u1EFF
Latin Extended AdditionalNote that \u00C0-\u00FF
is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7
and divide ÷ \u00F7
.
[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷
If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.
The original regex stopping at \u017F
borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218
, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E
, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")