I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren\'t, the alternate encodin
From said PEP (0268):
Python's tokenizer/compiler combo will need to be updated to work as follows:
read the file
decode it into Unicode assuming a fixed per-file encoding
convert it into a UTF-8 byte string
tokenize the UTF-8 content
compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding
Indeed, if you check Parser/tokenizer.c
in the Python source you'll find functions get_coding_spec and check_coding_spec which are responsible for finding this information on a line being examined in decoding_fgets.
It doesn't look like this capability is being exposed anywhere to you as a python API (at least these specific functions aren't Py
prefixed -, so your options are third party library and/or re-purposing these functions as an extension. I don't personally know of any third party libraries - I can't see this functionality in the standard library either.