Bulletproof work with encoding in Python

前端 未结 3 590

The question about unicode in Python2.

As I know about this I should always decode everything what I read from outside (files, net). decode

3条回答
  •  [愿得一人]
    2021-01-13 17:19

    ... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.

    ...

    These statements are right, ain't they?

    No, outside bytes are binary data, they are not a unicode string. So .decode("utf8") will produce a Python unicode object by interpreting the bytes in as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.

    Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.

    There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.

提交回复
热议问题