Encoding in python with lxml - complex solution

前端未结

关注

 2  605

忘了有多久 2020-12-13 16:22

I need to download and parse webpage with lxml and build UTF-8 xml output. I think schema in pseudocode is more illustrative:

from lxml import etree

webfile


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   一向
                                             
                
                
                (楼主)
            
              
              
                2020-12-13 16:50
              

            
            
                        
lxml can be a little wonky about input encodings.  It is best to send UTF8 in and get UTF8 out.

You might want to use the chardet module or UnicodeDammit to decode the actual data.

You'd want to do something vaguely like:

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)


I'm not sure why you are moving between lxml and etree, unless you are interacting with another library that already uses etree?
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复