ParseError: not well-formed (invalid token) using cElementTree

前端未结

关注

 13  1087

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTre


                      
              相关标签:


      
      
        
          13条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  伪装坚强ぢ        
                
              
                            
                2020-12-16 11:35
              
            
            
                                                                       
I have been in stuck with similar problem. Finally figured out the what was the root cause in my particular case. If you read the data from multiple XML files that lie in same folder you will parse also .DS_Store file.
Before parsing add this condition

for file in files:
    if file.endswith('.xml'):
       run_your_code...


This trick helped me as well
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2020-12-16 11:36
              
            
            
                                                                       
After lots of searching through the entire WWW, I only found out that you have to escape certain characters if you want your XML parser to work! Here's how I did it and worked for me:

escape_illegal_xml_characters = lambda x: re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', '', x)


And use it like you'd normally do:

ET.XML(escape_illegal_xml_characters(my_xml_string)) #instead of ET.XML(my_xml_string)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2020-12-16 11:39
              
            
            
                                                                       
What helped me with that error was Juan's answer - https://stackoverflow.com/a/20204635/4433222
But wasn't enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding.

The solution wasn't working for "normal" UTF-8.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2020-12-16 11:40
              
            
            
                                                                       
I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)

import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)


EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2020-12-16 11:43
              
            
            
                                                                       
lxml solved the issue, in my case

from lxml import etree

for _, elein etree.iterparse(xml_file, tag='tag_i_wanted', unicode='utf-8'):
    print(ele.tag, ele.text)  


in another case,

parser = etree.XMLParser(recover=True)
tree = etree.parse(xml_file, parser=parser)
tags_needed = tree.iter('TAG NAME')


Thanks to theeastcoastwest

Python 2.7
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  鱼传尺愫        
                
              
                            
                2020-12-16 11:45
              
            
            
                                                                       
See this answer to another question and the according part of the XML spec.

The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity &#8; and cannot occur plainly.

If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复