ParseError: not well-formed (invalid token) using cElementTree

前端未结

关注

 13  1062

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTre


                      
              相关标签:


      
      
        
          13条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2020-12-16 11:27
              
            
            
                                                                       
It seems to complain about \x08 you will need to escape that.

Edit: 

Or you can have the parser ignore the errors using recover

from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  终归单人心        
                
              
                            
                2020-12-16 11:28
              
            
            
                                                                       
This is most probably an encoding error. For example I had an xml file encoded in UTF-8-BOM (checked from the Notepad++ Encoding menu) and got similar error message. 

The workaround (Python 3.6)

import io
from xml.etree import ElementTree as ET

with io.open(file, 'r', encoding='utf-8-sig') as f:
    contents = f.read()
    tree = ET.fromstring(contents)


Check the encoding of your xml file. If it is using different encoding, change the 'utf-8-sig' accordingly.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2020-12-16 11:29
              
            
            
                                                                       
None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:

from bs4 import BeautifulSoup

with open("data/myfile.xml") as fp:
    soup = BeautifulSoup(fp, 'xml')


Then you can search the tree as:

soup.find_all('mytag')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暖寄归人        
                
              
                            
                2020-12-16 11:29
              
            
            
                                                                       
I tried the other solutions in the answers here but had no luck. Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so:

def ParseXmlTagContents(source, tag, tagContentsRegex):
    openTagString = "<"+tag+">"
    closeTagString = "</"+tag+">"
    found = re.search(openTagString + tagContentsRegex + closeTagString, source)
    if found:   
        start = found.regs[0][0]
        end = found.regs[0][1]
        return source[start+len(openTagString):end-len(closeTagString)]
    return ""


Example usage would be:

<?xml version="1.0" encoding="utf-16"?>
<parentNode>
    <childNode>123</childNode>
</parentNode>

ParseXmlTagContents(xmlString, "childNode", "[0-9]+")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  余生分开走        
                
              
                            
                2020-12-16 11:30
              
            
            
                                                                       
The only thing that worked for me is I had to add mode and encoding while opening the file like below:

with open(filenames[0], mode='r',encoding='utf-8') as f:
     readFile()


Otherwise it was failing every time with invalid token error if I simply do this:

 f = open(filenames[0], 'r')
 readFile()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2020-12-16 11:31
              
            
            
                                                                       
A solution for gottcha for me, using Python's ElementTree... this has the invalid token error:

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET

xml = u"""<?xml version='1.0' encoding='utf8'?>
<osm generator="pycrocosm server" version="0.6"><changeset created_at="2017-09-06T19:26:50.302136+00:00" id="273" max_lat="0.0" max_lon="0.0" min_lat="0.0" min_lon="0.0" open="true" uid="345" user="john"><tag k="test" v="Съешь же ещё этих мягких французских булок да выпей чаю" /><tag k="foo" v="bar" /><discussion><comment data="2015-01-01T18:56:48Z" uid="1841" user="metaodi"><text>Did you verify those street names?</text></comment></discussion></changeset></osm>"""

xmltest = ET.fromstring(xml.encode("utf-8"))


However, it works with the addition of a hyphen in the encoding type:

<?xml version='1.0' encoding='utf-8'?>


Most odd. Someone found this footnote in the python docs:


  The encoding string included in XML output should conform to the
  appropriate standards. For example, “UTF-8” is valid, but “UTF8” is
  not.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复