How to decode unicode in a Chinese text

前端 未结 4 1497
情书的邮戳
情书的邮戳 2021-01-01 05:49
with open(\'result.txt\', \'r\') as f:
data = f.read()

print \'What type is my data:\'
print type(data)

for i in data:
    print \"what is i:\"
    print i
    pri         


        
4条回答
  •  攒了一身酷
    2021-01-01 06:26

    data is a bytestring (str type on Python 2). Your loop looks at one byte at a time (non-ascii characters may be represented using more than one byte in utf-8).

    Don't call .encode() on bytes:

    $ python2
    >>> '\xe3'.enϲodе('utf˗8') #XXX don't do it
    Traceback (most recent call last):
      File "", line 1, in 
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
    

    I am trying to read the file and split the words by space and save them into a list.

    To work with Unicode text, use unicode type in Python 2. You could use io.open() to read Unicode text from a file (here's the code that collects all space-separated words into a list):

    #!/usr/bin/env python
    import io
    
    with io.open('result.txt', encoding='utf-8') as file:
        words = [word for line in file for word in line.split()]
    print "\n".join(words)
    

提交回复
热议问题