Is There a Way to Match Any Unicode non-Alphabetic Character?

前端 未结 2 898
被撕碎了的回忆
被撕碎了的回忆 2020-12-18 21:19

I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messe

2条回答
  •  轮回少年
    2020-12-18 22:14

    Depending on which language you're using, the regular expression engine may or may not be Unicode aware. If it is, it may or may not know the \p{} property tokens. If it does, your answer is in Unicode Characters and Properties in Jan Goyvaerts' regex tutorial.

    You can use \p{Latin}, if supported, to detect everything that is (or isn't, of course) from a language that uses any of the Unicode Latin blocks.

提交回复
热议问题