Is There a Way to Match Any Unicode non-Alphabetic Character?

前端 未结 2 899
被撕碎了的回忆
被撕碎了的回忆 2020-12-18 21:19

I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messe

相关标签:
2条回答
  • 2020-12-18 22:14

    Depending on which language you're using, the regular expression engine may or may not be Unicode aware. If it is, it may or may not know the \p{} property tokens. If it does, your answer is in Unicode Characters and Properties in Jan Goyvaerts' regex tutorial.

    You can use \p{Latin}, if supported, to detect everything that is (or isn't, of course) from a language that uses any of the Unicode Latin blocks.

    0 讨论(0)
  • 2020-12-18 22:23

    Check out Unicode character properties: http://www.regular-expressions.info/unicode.html#prop. I think what you are looking for is probably

    \p{L}
    

    which will match any letters or ideographs. You may also want to include letters with marks on them, so you could do

    \p{L}\p{M}*
    

    In any case, all the different types of character properties are detailed in the first link.

    Edit: You may also want to look at this Stack Overflow answer discussing whether \w matches unicode characters. They suggest that you could also use \p{Word} or \p{Alnum}: Does \w match all alphanumeric characters defined in the Unicode standard?

    0 讨论(0)
提交回复
热议问题