Python ElementTree won't convert non-breaking spaces when using UTF-8 for output

后端未结

关注

 5  1548

I\'m trying to parse, manipulate, and output HTML using Python\'s ElementTree:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as E


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2021-02-20 15:13
              
            
            
                                                                       
Your &nbsp; is being converted to '\xa0' which is the default (ascii) encoding for a nonbreaking space (the UTF-8 encoding is '\xc2\xa0'.) The line

'\xa0'.encode('utf-8')


results in a UnicodeDecodeError, because the default codec, ascii, only works up to 128 characters and ord('\xa0') = 160. Setting the default encoding to something else, i.e.:

import sys
reload(sys)  # must reload sys to use 'setdefaultencoding'
sys.setdefaultencoding('latin-1')

print '\xa0'.encode('utf-8', "xmlcharrefreplace")


should solve your problem.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2021-02-20 15:33
              
            
            
                                                                       
XML only defines &lt;, &gt;, &apos;, &quot; and &amp;. &nbsp; and others come from HTML. So you have a couple of choices.


You can change your source to use numeric entities, like &#160; or &#xA0; both of which are equivalent to &nbsp;.
You can use a DTD which defines those values.


There is some useful information (it is written about XSLT, but XSLT is written using XML, so the same applies) at the XSLT FAQ.



The question appears now to include a stack trace; that changes things. Are you sure that the string is in UTF-8? If it resolves to the single byte 0xA0, then it isn't UTF-8 but more likely cp1252 or iso-8859-1.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2021-02-20 15:37
              
            
            
                                                                       
HTML is not the same as XML, so tags like &nbsp; will not work. Ideally, if you are trying to pass that information via XML, you could first xml-encode the above data, so it would look something like this:

<xml>
<mydata>
&lt;htm&gt;
&lt;body&gt;
&lt;p&gt;Less than &amp;lt;&lt;/p&gt;
&lt;p&gt;Non-breaking space &amp;nbsp;&lt;/p&gt;
&lt;/body&gt;
&lt;/html&gt;
</mydata>
</xml>


And then after parsing the XML you can HTML-unencode the string.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2021-02-20 15:38
              
            
            
                                                                       
0xA0 is a latin1 character, not a unicode character and the value of p.text in the loop is a str and not unicode, that means that in order to encode it in utf-8 it must first be converted  by Python implicitly into a unicode string (i.e. using decode). When it is doing this it assumes ascii since it wasn't told anything else. 0xa0 is not a valid ascii character, but it is a valid latin1 character.

The reason you have latin1 characters instead of unicode characters is because entitydefs is a mapping of names to latin1 encode strings. You need the unicode code point which you can get from htmlentitydef.name2codepoint

The version below should fix it for you:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as ET
from htmlentitydefs import name2codepoint

source = StringIO("""<html>
<body>
<p>Less than &lt;</p>
<p>Non-breaking space &nbsp;</p>
</body>
</html>""")

parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update((x, unichr(i)) for x, i in name2codepoint.iteritems())
etree = ET.ElementTree()

tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
    print ET.tostring(p, encoding='UTF-8')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2021-02-20 15:39
              
            
            
                                                                       
I think the problem you have here is not with your nbsp entity but with your print statement.

Your error is:


  UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 19: ordinal not in range(128)


I think this is because you're taking a utf-8 string (from ET.tostring(p, encoding='utf-8')) and trying to echo it out in a ascii terminal.  So Python is implicitly converting that string to unicode then converting it again to ascii. Although nbsp can be represented directly in utf-8, it cannot be represented directly in ascii. Hence the error.

Try saving the output to a file instead and seeing if you get what you expect.

Alternatively, try print ET.toString(p, encoding='ascii'), which should cause ElementTree to use numeric character entities to represent anything that can't be represented with ascii.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复