How to decode unicode in a Chinese text

前端未结

关注

 4  1502

情书的邮戳 2021-01-01 05:49

with open(\'result.txt\', \'r\') as f:
data = f.read()

print \'What type is my data:\'
print type(data)

for i in data:
    print \"what is i:\"
    print i
    pri


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   爱一瞬间的悲伤
                                             
                
                
                (楼主)
            
              
              
                2021-01-01 06:35
              

            
            
                        
Let me give you some hints:


You'll need to decode the bytes you read from UTF-8 into Unicode before you try to iterate over the words.
When you read a file, you won't get Unicode back. You'll just get plain bytes. (I think you knew that, since you're already using decode().)
There is a standard function to "split by space" called split().
When you say for i in data, you're saying you want to iterate over every byte of the file you just read. Each iteration of your loop will be a single character. I'm not sure if that's what you want, because that would mean you'd have to do UTF-8 decoding by hand (rather than using decode(), which must operate on the entire UTF-8 string.).


In other words, here's one line of code that would do it:

open('file.txt').read().decode('utf-8').split()


If this is homework, please don't turn that in. Your teacher will be onto you. ;-)



Edit: Here's an example how to encode and decode unicode characters in python:

>>> data = u"わかりません"
>>> data
u'\u308f\u304b\u308a\u307e\u305b\u3093'
>>> data_you_would_see_in_a_file = data.encode('utf-8')
>>> data_you_would_see_in_a_file
'\xe3\x82\x8f\xe3\x81\x8b\xe3\x82\x8a\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93'
>>> for each_unicode_character in data_you_would_see_in_a_file.decode('utf-8'):
...     print each_unicode_character
... 
わ
か
り
ま
せ
ん


The first thing to note is that Python (well, at least Python 2) uses the u"" notation (note the u prefix) on string constants to show that they are Unicode. In Python 3, strings are Unicode by default, but you can use b"" if you want a byte string.

As you can see, the Unicode string is composed of two-byte characters. When you read the file, you get a string of one-byte characters (which is equivalent to what you get when you call .encode(). So if you have bytes from a file, you must call .decode() to convert them back into Unicode. Then you can iterate over each character.

Splitting "by space" is something unique to every language, since many languages (for example, Chinese and Japanese) do not uses the ' ' character, like most European languages would. I don't know how to do that in Python off the top of my head, but I'm sure there is a way.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复