How to portably parse the (Unicode) degree symbol with regular expressions?

流过昼夜 提交于 2019-12-01 15:08:01
reclosedev

Possible portable solution:

Convert input data to unicode, and use re.UNICODE flag in regular expressions.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re


data = u'temp1:        +31.0°C  (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+' 
                     ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)

print temp_re.findall(data)

Output

[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]

EDIT

@netvope allready pointed this out in comments for question.

Update

Notes from J.F. Sebastian comments about input encoding:

check_output() returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u'°') == 176 so it can not be encoded using ASCII encoding.

So, to decode input data to unicode, basically* you should use encoding from system locale using locale.getpreferredencoding() e.g.:

data = subprocess.check_output(...).decode(locale.getpreferredencoding())

With data encoded correctly:

you'll get the same output without re.UNICODE in this case.


Why basically? Because on Russian Win7 with cp1251 as preferredencoding if we have for example script.py which decodes it's output to utf-8:

#!/usr/bin/env python
# -*- coding: utf8 -*-

print u'temp1: +31.0°C  (crit = +107.0°C)'.encode('utf-8')

And wee need to parse it's output:

subprocess.check_output(['python', 
                         'script.py']).decode(locale.getpreferredencoding())

will produce wrong results: 'В°' instead °.

So you need to know encoding of input data, in some cases.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!