Python3, how to encode this string correctly?

北战南征 提交于 2019-12-13 03:14:58

问题


disclaimer, I've already done a long research to solve that alone but most of the questions I found here concern Python 2.7 or doesn't solve my problem

Let's say I've the following (That example comes from BeautifulSoup doc, I'm trying to solve a bigger issue):

>>> markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(markup)
'Sacré bleu!'

For me, markup should be assigned to a bytes, so I could do:

>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> print(str(markup, 'utf-8'))
<h1>Sacré bleu!</h1>

Yeah ! but how do I do that transition between "<h1>Sacr\xc3\xa9 bleu!</h1>" which is wrong into b"<h1>Sacr\xc3\xa9 bleu!</h1>" ?

Because if I do:

>>> markup = b"<h1>Sacr\xc3\xa9 bleu!</h1>"
>>> bytes(markup, "utf-8")
b'<h1>Sacr\xc3\x83\xc2\xa9 bleu!</h1>'

You see? It inserted \x83\xc2 for free.

>>> print(bytes(markup))
TypeError: string argument without an encoding

回答1:


If you have the Unicode string "<h1>Sacr\xc3\xa9 bleu!</h1>", something has already gone wrong. Either your input is broken, or you did something wrong when processing it. For example, here, you've copied a Python 2 example into a Python 3 interpreter.

If you have your broken string because you did something wrong to get it, then you should really fix whatever it was you did wrong. If you need to convert "<h1>Sacr\xc3\xa9 bleu!</h1>" to b"<h1>Sacr\xc3\xa9 bleu!</h1>" anyway, then encode it in latin-1:

bytestring = broken_unicode.encode('latin1')


来源:https://stackoverflow.com/questions/52123701/python3-how-to-encode-this-string-correctly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!