Decoding unknown encoded Traditional Chinese character strings using Python

前端未结

关注

 2  1091

Hi I have a website that is in Traditional Chinese and when I check the site statistics it tell me that the search term for the website is å%8f°å%8d%97 è¦ªå%90é¤%90å»³


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2021-01-25 14:42
              
            
            
                                                                       
You can use chardet. Install the library with:

pip install chardet
# or for python3
pip3 install chardet


The library includes a cli utility chardetect (or chardetect3 accordingly) that takes the path to a file.

Once you know the encoding you can use it in python for example like this:

codecs.open('myfile.txt', 'r', 'GB2312')


or from shell:

iconv -f GB2312 -t UTF-8 myfile.txt -o decoded.txt


^{If you need more performance then there is also cchardet — a faster C-optimized version of chardet.}
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2021-01-25 14:55
              
            
            
                                                                       
It is called a mutt encoding; the underlying bytes have been mangled beyond their original meaning and they are no longer a real encoding.

It was once URL-quoted UTF-8, but now interpreted as latin-1 without unquoting those URL escapes. I was able to un-mangle this by interpreting it as such:

>>> from urllib2 import unquote
>>> bytesquoted = u'å%8f°å%8d%97 è¦ªå%90é¤%90å»³'.encode('latin1')
>>> unquoted = unquote(bytesquoted)
>>> print unquoted.decode('utf8')
台南 親子餐廳

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复