Unicode (UTF-8) reading and writing to files in Python

前端 未结 14 1209
谎友^
谎友^ 2020-11-22 17:10

I\'m having some brain failure in understanding reading and writing text to a file (Python 2.4).

# The string, which has an a-acute in it.
ss = u\'Capit\\xe1         


        
14条回答
  •  野趣味
    野趣味 (楼主)
    2020-11-22 17:22

    You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?

    Answer: You can't unless the file format provides for this. XML, for example, begins with:

    
    

    This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.

    As for your editor, you must check if it offers some way to set the encoding of a file.

    The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.

    The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).

    That said, you can use the Python function eval() to turn an escaped string into a string:

    >>> x = eval("'Capit\\xc3\\xa1n\\n'")
    >>> x
    'Capit\xc3\xa1n\n'
    >>> x[5]
    '\xc3'
    >>> len(x[5])
    1
    

    As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:

    >>> x.decode('utf-8')
    u'Capit\xe1n\n'
    

    Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:

    0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n
    

    codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?

    Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).

    So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.

    Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().

    Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.

提交回复
热议问题