Convert unicode with utf-8 string as content to str

匿名 (未验证) 提交于 2019-12-03 02:49:01

问题:

I'm using pyquery to parse a page:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'}) content = dom('#mw-content-text > p').eq(0).text() 

but what I get in content is a unicode string with utf-8 encoded content:

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...' 

how could I convert it to str without lost the content?

to make it clear:

I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

回答1:

If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

content = content.encode('latin1') 

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

For your example this gives me:

>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8' >>> content.encode('latin1') '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8' >>> content.encode('latin1').decode('utf8') u'\u5c42\u53e0\u6837\u5f0f\u8868' >>> print content.encode('latin1').decode('utf8') 层叠样式表 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!