Concrete Javascript Regex for Accented Characters (Diacritics)

后端 未结 9 1086
庸人自扰
庸人自扰 2020-11-22 17:22

I\'ve looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn\'t follow the Unicode standard concerning RegExp, etc.) and haven\'t really found a concrete

9条回答
  •  无人及你
    2020-11-22 17:39

    The accented Latin range \u00C0-\u017F was not quite enough for my database of names, so I extended the regex to

    [a-zA-Z\u00C0-\u024F]
    [a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars
    

    I added these code blocks (\u00C0-\u024F includes three adjacent blocks at once):

    • \u00C0-\u00FF Latin-1 Supplement
    • \u0100-\u017F Latin Extended-A
    • \u0180-\u024F Latin Extended-B
    • \u1E00-\u1EFF Latin Extended Additional

    Note that \u00C0-\u00FF is actually only a part of Latin-1 Supplement. It skips unprintable control signals and all symbols except for the awkwardly-placed multiply × \u00D7 and divide ÷ \u00F7.

    [a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u024F] // exclude ×÷
    

    If you need more code points, you can find more ranges on Wikipedia's List of Unicode characters. For example, you could also add Latin Extended-C, D, and E, but I left them out because only historians seem interested in them now, and the D and E sets don't even render correctly in my browser.

    The original regex stopping at \u017F borked on the name "Șenol". According to FontSpace's Unicode Analyzer, that first character is \u0218, LATIN CAPITAL LETTER S WITH COMMA BELOW. (Yeah, it's usually spelled with a cedilla-S \u015E, "Şenol." But I'm not flying to Turkey to go tell him, "You're spelling your name wrong!")

提交回复
热议问题