python replace unicode characters

纵饮孤独 提交于 2019-12-07 06:19:58

问题


I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field.

Below is one of the example:

(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

I want to replace all the \x.. with a ?

I explicitly type \xc2 as follows works

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
re.sub('\\\xc2', '?', line)
result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)'

But its not working if I write as follow:

re.sub('\\\x..', '?', line)

How I can write a regular expression to replace them all?


回答1:


There are better tools for this job than regex, you could try for example:

>>> line
'(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
>>> line.decode('ascii', 'ignore')
u'(13)p(5)example(3)com(0)'

That skips non-ascii characters. Or with replace, you can swap them for a '?' placeholder:

>>> print line.decode('ascii', 'replace')
(13)��������p����(5)example(3)com(0)

But the best solution is to find out what erroneous encoding/decoding caused the mojibake to happen in the first place, so you can recover data by using the correct code pages.

There is an excellent answer about unbaking emojibake here. Note that it's an inexact science, and a lot of the crucial information is actually in the comment thread under that answer.




回答2:


what about this?

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

pattern = r'\\x.+'
re.sub(pattern, r'?', line)


来源:https://stackoverflow.com/questions/39751705/python-replace-unicode-characters

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!