Parsing Large XML with Nokogiri

后端未结
关注
 4  1273
忘了有多久 2021-01-07 10:10
So I\'m attempting to parse a 400k+ line XML file using Nokogiri.
The XML file has this basic format:

      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   情歌与酒
                                             
                
                
                (楼主)
            
              
              
                2021-01-07 10:41
              

            
            
                        
I see a few possible problems. First of all, this:

@doc = Nokogiri::XML(sympFile)


will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.

Then you do things like this:

@doc.xpath(...).each


That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet that xpath returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.

Then you make your copy of what you're interested in:

symptomsList.push([signId, name])


and finally iterate over that array:

symptomsList.each do |x|
    Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end


I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:

class D < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [ ])
    if(name == 'DisorderSign')
      @data = { }
    elsif(name == 'ClinicalSign')
      @key        = :sign
      @data[@key] = ''
    elsif(name == 'SignFreq')
      @key        = :freq
      @data[@key] = ''
    elsif(name == 'Name')
      @in_name = true
    end
  end

  def characters(str)
    @data[@key] += str if(@key && @in_name)
  end

  def end_element(name, attrs = [ ])
    if(name == 'DisorderSign')
      # Dump @data into the database here.
      @data = nil
    elsif(name == 'ClinicalSign')
      @key = nil
    elsif(name == 'SignFreq')
      @key = nil
    elsif(name == 'Name')
      @in_name = false
    end
  end
end


The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the

# Dump @data into the database here.


comment.

This structure makes it pretty easy to watch for the  elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复