Python - read text file with weird utf-16 format

前端 未结 4 804
走了就别回头了
走了就别回头了 2021-01-17 17:00

I\'m trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:

file = open(\'data.txt\',\'r\')

lines = file.         


        
4条回答
  •  刺人心
    刺人心 (楼主)
    2021-01-17 17:33

    I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.

    In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.

    To fix this, just decode the data:

    print line.decode('utf-16-le').split()
    

    Or do the same thing at the file level with the io or codecs module:

    file = io.open('data.txt','r', encoding='utf-16-le')
    

    * This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.

提交回复
热议问题