Delphi RegEX library and unicode characters

旧城冷巷雨未停 提交于 2020-03-04 07:10:41

问题


If one uses \b for a word boundary, it seems it understands only ASCII alphabet for example the pattern

\bM\b will match aaaa M bbbbbb

but if I have

aaaaa Mädchen 

it will too, because it considers ä to be an end of word.

Are there any flags to set for this regexp lib to accept Unicode strings too? It seems very unlikely that this lib would be so primitive but it is not in the options

TRegExOption = (roNone, roIgnoreCase, roMultiLine, roExplicitCapture,
roCompiled, roSingleLine, roIgnorePatternSpace);

回答1:


According to regular-expressions.info, Delphi regex lib is based on PCRE and the predefined character class \w in PCRE is only ASCII based, therefore \b is also only ASCII based.




回答2:


You can use lookaround to make your own word boundaries to fit your preferred definition of a "word". E.g. if you want to match "M" as a word and treat all Unicode letters, numbers and marks as word characters, use:

(?<![\pL\pN\pM])M(?![\pL\pN\pM])


来源:https://stackoverflow.com/questions/14208789/delphi-regex-library-and-unicode-characters

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!