问题
If one uses \b for a word boundary, it seems it understands only ASCII alphabet
for example the pattern
\bM\b will match aaaa M bbbbbb
but if I have
aaaaa Mädchen
it will too, because it considers ä to be an end of word.
Are there any flags to set for this regexp lib to accept Unicode strings too? It seems very unlikely that this lib would be so primitive but it is not in the options
TRegExOption = (roNone, roIgnoreCase, roMultiLine, roExplicitCapture,
roCompiled, roSingleLine, roIgnorePatternSpace);
回答1:
According to regular-expressions.info, Delphi regex lib is based on PCRE and the predefined character class \w in PCRE is only ASCII based, therefore \b is also only ASCII based.
回答2:
You can use lookaround to make your own word boundaries to fit your preferred definition of a "word". E.g. if you want to match "M" as a word and treat all Unicode letters, numbers and marks as word characters, use:
(?<![\pL\pN\pM])M(?![\pL\pN\pM])
来源:https://stackoverflow.com/questions/14208789/delphi-regex-library-and-unicode-characters