I wish to write a python program which reads files containing unicode text. These files are normally encoded with UTF-8, but might not be; if they aren't, the alternate encoding will be explicitly declared at the beginning of the file. More precisely, it will be declared using exactly the same rules as Python itself uses to allow Python source code to have an explicitly declared encoding (as in PEP 0263, see https://www.python.org/dev/peps/pep-0263/ for more details). Just to be clear, the files being processed are not actually python source, but they do declare their encodings (when not in UTF-8) using the same rules.
If one knows the encoding of a file before one opens it, Python provides a very easy way to read the file with automatic decoding: the
codecs.open command; for instance, one might do:
import codecs f = codecs.open('unicode.rst', encoding='utf-8') for line in f: print repr(line)
line we get in the loop will be a unicode string. Is there a Python library which does a similar thing, but choosing the encoding according to the rules above (which are Python 3.0's rules, I think)? (e.g. does Python expose the 'read file with self-declared encoding' it uses to read source to the language?) If not, what's the easiest way to achieve the desired effect?
One thought is to open the file using the usual
open, read the first two lines, interpret them as UTF-8, look for a coding declaration using the regexp in the PEP, and if one finds one start decoding all subsequent lines using the encoding declared. For this to be sure to work, we need to know that for all the encodings that Python allows in Python source, the usual Python
readline will correctly split the file into lines - that is, we need to know that for all the encodings Python allows in Python source, the byte string '\n' always really mean newline, and isn't part of some multi-byte sequence encoding another character. (In fact I also need to worry about '\r\n' as well.) Does anyone know if this is true? The docs were not very specific.
Another thought is to look in the Python sources. Does anyone know where in the Python source the source-code-encoding-processing is done?
You should be able to roll your own decoder in Python. If you're only supporting 8-bit encodings which are supersets of ASCII the code below should work as-is.
If you need support 2-byte encodings like UTF-16 you'd need to augment the pattern to match
\x00c\x00o.. or the reverse, depending on the byte order mark.
First, generate a few test files which advertise their encoding:
import codecs, sys for encoding in ('utf-8', 'cp1252'): out = codecs.open('%s.txt' % encoding, 'w', encoding) out.write('# coding = %s\n' % encoding) out.write(u'\u201chello se\u00f1nor\u201d') out.close()
Then write the decoder:
import codecs, re def open_detect(path): fin = open(path, 'rb') prefix = fin.read(80) encs = re.findall('#\s*coding\s*=\s*([\w\d\-]+)\s+', prefix) encoding = encs if encs else 'utf-8' fin.seek(0) return codecs.EncodedFile(fin, 'utf-8', encoding) for path in ('utf-8.txt','cp1252.txt'): fin = open_detect(path) print repr(fin.readlines())
['# coding = utf-8\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d'] ['# coding = cp1252\n', '\xe2\x80\x9chello se\xc3\xb1nor\xe2\x80\x9d']
I examined the sources of
tokenizer.c (thanks to @Ninefingers for suggesting this in another answer and giving a link to the source browser). It seems that the exact algorithm used by Python is (equivalent to) the following. In various places I'll describe the algorithm as reading byte by byte---obviously one wants to do something buffered in practice, but it's easier to describe this way. The initial part of the file is processed as follows:
- Upon opening a file, attempt to recognize the UTF-8 BOM at the beginning of the file. If you see it, eat it and make a note of the fact you saw it. Do not recognize the UTF-16 byte order mark.
- Read 'a line' of text from the file. 'A line' is defined as follows: you keep reading bytes until you see one of the strings '\n', '\r' or '\r\n' (trying to match as long a string as possible---this means that if you see '\r' you have to speculatively read the next character, and if it's not a '\n', put it back). The terminator is included in the line, as is usual Python practice.
- Decode this string using the UTF-8 codec. Unless you have seen the UTF-8 BOM, generate an error message if you see any non-ASCII characters (i.e. any characters above 127). (Python 3.0 does not, of course, generate an error here.) Pass this decoded line on to the user for processing.
- Attempt to interpret this line as a comment containing a coding declaration, using the regexp in PEP 0263. If you find a coding declaration, skip to the instructions below for 'I found a coding declaration'.
- OK, so you didn't find a coding declaration. Read another line from the input, using the same rules as in step 2 above.
- Decode it, using the same rules as step 3, and pass it on to the user for processing.
- Attempt again to interpred this line as a coding declaration comment, as in step 4. If you find one, skip to the instructions below for 'I found a coding declaration'.
- OK. We've now checked the first two lines. According to PEP 0263, if there was going to be a coding declaration, it would have been on the first two lines, so we now know we're not going to see one. We now read the rest of the file using the same reading instructions as we used to read the first two lines: we read the lines using the rules in step 2, decode using the rules in step 3 (making an error if we see non-ASCII bytes unless we saw a BOM).
Now the rules for what to do when 'I found a coding declaration':
- If we previously saw a UTF-8 BOM, check that the coding declaration says 'utf-8' in some form. Throw an error otherwise. (''utf-8' in some form' means anything which, after converting to lower case and converting underscores to hyphens, is either the literal string
'utf-8', or something beginning with
- Read the rest of the file using the decoder associated to the given encoding in the Python
codecsmodule. In particular, the division of the rest of the bytes in the file into lines is the job of the new encoding.
- One final wrinkle: universal newline type stuff. The rules here are as follows. If the encoding is anything except 'utf-8' in some form or 'latin-1' in some form, do no universal-newline stuff at all; just pass out lines exactly as they come from the decoder in the
codecsmodule. On the other hand, if the encoding is 'utf-8' in some form or 'latin-1' in some form, transform lines ending '\r' or '\r\n' into lines ending '\n'. (''utf-8' in some form' means the same as before. ''latin-1' in some form' means means anything which, after converting to lower case and converting underscores to hyphens, is one of the literal strings
'iso-8859-1', or any string beginning with one of
For what I'm doing, fidelity to Python's behaviour is important. My plan is to roll an implementation of the algorithm above in Python, and use this. Thanks for everyone who answered!
From said PEP (0268):
Python's tokenizer/compiler combo will need to be updated to work as follows:
read the file
decode it into Unicode assuming a fixed per-file encoding
convert it into a UTF-8 byte string
tokenize the UTF-8 content
compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding
Indeed, if you check
Parser/tokenizer.c in the Python source you'll find functions get_coding_spec and check_coding_spec which are responsible for finding this information on a line being examined in decoding_fgets.
It doesn't look like this capability is being exposed anywhere to you as a python API (at least these specific functions aren't
Py prefixed -, so your options are third party library and/or re-purposing these functions as an extension. I don't personally know of any third party libraries - I can't see this functionality in the standard library either.
Starting from Python 3.4 there is a function which allows you to do what you're asking for –
According to documentation:
Decode the given bytes representing source code and return it as a string with universal newlines (as required by
Brett Cannon talks about this function in his talk From Source to Code: How CPython's Compiler Works.
There is support for this in the standard library, even in Python 2. Here is code you can use:
try: # Python 3 from tokenize import open as open_with_encoding_check except ImportError: # Python 2 from lib2to3.pgen2.tokenize import detect_encoding import io def open_with_encoding_check(filename): """Open a file in read only mode using the encoding detected by detect_encoding(). """ fp = io.open(filename, 'rb') try: encoding, lines = detect_encoding(fp.readline) fp.seek(0) text = io.TextIOWrapper(fp, encoding, line_buffering=True) text.mode = 'r' return text except: fp.close() raise
Then personally I needed to parse and compile this source. In Python 2 it's an error to compile unicode text that includes an encoding declaration, so lines containing the declaration have to be made blank (not removed, as this changes line numbers) first. So I also made this function:
def read_source_file(filename): from lib2to3.pgen2.tokenize import cookie_re with open_with_encoding_check(filename) as f: return ''.join([ '\n' if i < 2 and cookie_re.match(line) else line for i, line in enumerate(f) ])
I'm using these in my package, the latest source (in case I find I need to change them) can be found here, while tests are here.