C# Regular Expressions with \\Uxxxxxxxx characters in the pattern

醉酒当歌 提交于 2019-11-29 10:36:57

They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?

Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The \uxxxx pattern in the regular expression documentation only covers 0-65535, just like \uxxxx as a C# escape sequence.)

Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?

To workaround such things with .Net regex engine, I'm using following trick: "[\U010000-\U10FFFF]" is replaced with [\uD800-\uDBFF][\uDC00-\uDFFF] The idea behind this is that as .Net regexes handle code units instead of code points, we're providing it with surrogate ranges as regular characters. It's also possible to specify more narrow ranges by operating with edges, e.g.: [\U011DEF-\U013E07] is same as (?:\uD807[\uDDEF-\uDFFF])|(?:[\uD808-\uD80E][\uDC00-\uDFFF])|(?:\uD80F[\uDC00-uDE07])

It's harder to read and operate with, and it's not that flexible, but still fits as workaround.

@Jon Skeet

So what you are telling me is that there is not a way to use the Regex tools in .net to match on chars outside of the utf-16 range?

The full regex is:

^(\u0009|[\u0020-\u007E]|\u0085|[\u00A0-\uD7FF]|[\uE000-\uFFFD]|[\U00010000-\U0010FFFF])+$

I am attempting to check if a string only contains what a yaml document defines as printable Unicode chararters.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!