Umlauts in regexp matching (via locale?)

前端未结

关注

 2  1999

I\'m surprised that I\'m not able to match a German umlaut in a regexp. I tried several approaches, most involving setting locales, but up to now to no avail.

相关标签:

2条回答

再見小時候

2020-12-17 20:01
Have you tried to use the re.UNICODE flag, as described in the doc?
```
>>> re.findall(r'\w+', 'abc def güi jkl', re.UNICODE)
['abc', 'def', 'g\xc3\xbci', 'jkl']
```
A quick search points to this thread that gives some explanation:

re.LOCALE just passes the character to the underlying C library. It really only works on bytestrings which have 1 byte per character. UTF-8 encodes codepoints outside the ASCII range to multiple bytes per codepoint, and the re module will treat each of those bytes as a separate character.
0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-12-17 20:17

In my case \S gave me better results than \w, plus saving the file as utf-8, plus using re.UNICODE

0 讨论(0)
发布评论:

提交评论
- 加载中...