Is There a Way to Match Any Unicode non-Alphabetic Character?

前端未结

关注

 2  899

I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messe

相关标签:

2条回答

轮回少年

2020-12-18 22:14

Depending on which language you're using, the regular expression engine may or may not be Unicode aware. If it is, it may or may not know the \p{} property tokens. If it does, your answer is in Unicode Characters and Properties in Jan Goyvaerts' regex tutorial.

You can use \p{Latin}, if supported, to detect everything that is (or isn't, of course) from a language that uses any of the Unicode Latin blocks.

0 讨论(0)
发布评论:

提交评论
- 加载中...
盖世英雄少女心

2020-12-18 22:23
Check out Unicode character properties: http://www.regular-expressions.info/unicode.html#prop. I think what you are looking for is probably
```
\p{L}
```
which will match any letters or ideographs. You may also want to include letters with marks on them, so you could do
```
\p{L}\p{M}*
```
In any case, all the different types of character properties are detailed in the first link.

Edit: You may also want to look at this Stack Overflow answer discussing whether \w matches unicode characters. They suggest that you could also use \p{Word} or \p{Alnum}: Does \w match all alphanumeric characters defined in the Unicode standard?
0 讨论(0)
发布评论:

提交评论
- 加载中...