Is There a Way to Match Any Unicode non-Alphabetic Character?

前端未结

关注

 2  898

被撕碎了的回忆 2020-12-18 21:19

I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messe

2条回答

轮回少年 (楼主)

2020-12-18 22:14

Depending on which language you're using, the regular expression engine may or may not be Unicode aware. If it is, it may or may not know the \p{} property tokens. If it does, your answer is in Unicode Characters and Properties in Jan Goyvaerts' regex tutorial.

You can use \p{Latin}, if supported, to detect everything that is (or isn't, of course) from a language that uses any of the Unicode Latin blocks.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...