Read a unicode file in python which declares its encoding in the same way as python source

前端 未结 3 1224
情书的邮戳
情书的邮戳 2021-02-14 19:07

I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren\'t, the alternate encodin

3条回答
  •  轮回少年
    2021-02-14 19:40

    From said PEP (0268):

    Python's tokenizer/compiler combo will need to be updated to work as follows:

    1. read the file

    2. decode it into Unicode assuming a fixed per-file encoding

    3. convert it into a UTF-8 byte string

    4. tokenize the UTF-8 content

    5. compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding

    Indeed, if you check Parser/tokenizer.c in the Python source you'll find functions get_coding_spec and check_coding_spec which are responsible for finding this information on a line being examined in decoding_fgets.

    It doesn't look like this capability is being exposed anywhere to you as a python API (at least these specific functions aren't Py prefixed -, so your options are third party library and/or re-purposing these functions as an extension. I don't personally know of any third party libraries - I can't see this functionality in the standard library either.

提交回复
热议问题