URL component % and \x

问题

I have a doubt.

st = "b%C3%BCrokommunikation"
urllib2.unquote(st)

OUTPUT: 'b\xc3\xbcrokommunikation' But, if I print it:

print urllib2.unquote(st)

OUTPUT: bürokommunikation

Why is the difference? I have to write bürokommunikation instead of 'b\xc3\xbcrokommunikation' into a file.

My problem is: I have lots of data with such values extracted from URLs. I have to store them as eg. bürokommunikation into a text file.

回答1:

When you print the string, your terminal emulator recognizes the unicode character \xc3\xbc and displays it correctly.

However, as @MarkDickinson says in the comments, ü doesn't exist in ASCII, so you'll need to tell Python that the string you want to write to a file is unicode encoded, and what encoding format you want to use, for instance UTF-8.

This is very easy using the codecs library:

import codecs

# First create a Python UTF-8 string
st = "b%C3%BCrokommunikation"
encoded_string = urllib2.unquote(st).decode('utf-8')

# Write it to file keeping the encoding
with codecs.open('my_file.txt', 'w', 'utf-8') as f:
    f.write(encoded_string)

回答2:

You are looking at the same result. when you try to print it without print command, it just show the __repr__() result. when you use print, it shows the unicode character instead of escaping it with \x

来源：https://stackoverflow.com/questions/34379432/url-component-and-x

标签

python

urllib2

urllib

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!