Converting from utf-16 to utf-8 in Python 3

℡╲_俬逩灬. 提交于 2019-12-19 11:24:54

问题


I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.

As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8. I'd appreciate your help very much.


回答1:


In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).

What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters

def main():
    # Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
    with open('output_file', 'w', encoding='utf-8') as out_file:
        # read every line. We give open() the encoding so it will return a Unicode string. 
        for line in open('input_file', encoding='utf-8'):
            #Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
            print(line.replace('ć', 'ç'), out_file)

So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.

Still, because it is interesting, here the hard way, where you encode everything yourself:

def main():
    # Open the file in binary mode. So we are going to write bytes to it instead of strings
    with open('output_file', 'wb') as out_file:
        # read every line. Again, we open it binary, so we get bytes 
        for line_bytes in open('input_file', 'rb'):
            #Convert the bytes to a string
            line_string = bytes.decode('utf-8')
            #Replace the characters we want. 
            line_string = line_string.replace('ć', 'ç')
            #Make a bytes to print
            out_bytes = line_string.encode('utf-8')
            #Print the bytes
            print(out_bytes, out_file)

Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!

Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.



来源:https://stackoverflow.com/questions/3140010/converting-from-utf-16-to-utf-8-in-python-3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!