Read a unicode file in python which declares its encoding in the same way as python source

前端未结

关注

 3  1247

情书的邮戳 2021-02-14 19:07

I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren\'t, the alternate encodin

3条回答

轮回少年 (楼主)

2021-02-14 19:40
From said PEP (0268):
Python's tokenizer/compiler combo will need to be updated to work as follows:
1. read the file
2. decode it into Unicode assuming a fixed per-file encoding
3. convert it into a UTF-8 byte string
4. tokenize the UTF-8 content
5. compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding
Indeed, if you check Parser/tokenizer.c in the Python source you'll find functions get_coding_spec and check_coding_spec which are responsible for finding this information on a line being examined in decoding_fgets.

It doesn't look like this capability is being exposed anywhere to you as a python API (at least these specific functions aren't Py prefixed -, so your options are third party library and/or re-purposing these functions as an extension. I don't personally know of any third party libraries - I can't see this functionality in the standard library either.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...