Python to parse non-standard XML file

前端未结
关注
 3  1961
野趣味 2021-01-05 11:24
My input file is actually multiple XML files appending to one file. (It\'s from Google Patents). It has below structure:

      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   臣服心动
                                             
                
                
                (楼主)
            
              
              
                2021-01-05 12:02
              

            
            
                        
I'd opt for parsing each chunk of XML separately. 

You seem to already be doing that in your sample code. Here's my take on your code:

def parse_xml_buffer(buffer):
    dom = minidom.parseString("".join(buffer))  # join list into string of XML
    # .... parse dom ...

buffer = [file.readline()]  # initialise with the first line
for line in file:
    if line.startswith("


Once you've broken the file down to individual XML blocks, how you actually do the parsing depends on your requirements and, to some extent, your preference. Options are lxml, minidom, elementtree, expat, BeautifulSoup, etc.



Update:

Starting from scratch, here's how I would do it (using BeautifulSoup):

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

def separated_xml(infile):
    file = open(infile, "r")
    buffer = [file.readline()]
    for line in file:
        if line.startswith("


This returns:

D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...

    
             
                                                        
            

            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          

                              			
        

        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复