Parsing Large XML with Nokogiri

老子叫甜甜 提交于 2019-12-09 01:50:48

问题


So I'm attempting to parse a 400k+ line XML file using Nokogiri.

The XML file has this basic format:

<?xml version="1.0" encoding="windows-1252"?>
<JDBOR date="2013-09-01 04:12:31" version="1.0.20 [2012-12-14]" copyright="Orphanet (c) 2013">
 <DisorderList count="6760">

  *** Repeated Many Times ***
  <Disorder id="17601">
  <OrphaNumber>166024</OrphaNumber>
  <Name lang="en">Multiple epiphyseal dysplasia, Al-Gazali type</Name>
  <DisorderSignList count="18">
    <DisorderSign>
      <ClinicalSign id="2040">
        <Name lang="en">Macrocephaly/macrocrania/megalocephaly/megacephaly</Name>
      </ClinicalSign>
      <SignFreq id="640">
        <Name lang="en">Very frequent</Name>
      </SignFreq>
    </DisorderSign>
  </Disorder>
  *** Repeated Many Times ***

 </DisorderList>
</JDBOR>

Here is the code I've created to parse and return each DisorderSign id and name into a database:

require 'nokogiri'

sympFile = File.open("Temp.xml")
@doc = Nokogiri::XML(sympFile)
sympFile.close()
symptomsList = []

@doc.xpath("////DisorderSign").each do |x|
    signId = x.at('ClinicalSign').attribute('id').text()      
    name = x.at('ClinicalSign').element_children().text()
    symptomsList.push([signId, name])
end

symptomsList.each do |x|
    Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end

This works perfect on the test files I've used, although they were much smaller, around 10000 lines.

When I attempt to run this on the large XML file, it simply does not finish. I left it on overnight and it seemed to just lockup. Is there any fundamental reason the code I've written would make this very memory intensive or inefficient? I realize I store every possible pair in a list, but that shouldn't be large enough to fill up memory.

Thank you for any help.


回答1:


I see a few possible problems. First of all, this:

@doc = Nokogiri::XML(sympFile)

will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.

Then you do things like this:

@doc.xpath(...).each

That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet that xpath returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.

Then you make your copy of what you're interested in:

symptomsList.push([signId, name])

and finally iterate over that array:

symptomsList.each do |x|
    Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end

I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:

class D < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [ ])
    if(name == 'DisorderSign')
      @data = { }
    elsif(name == 'ClinicalSign')
      @key        = :sign
      @data[@key] = ''
    elsif(name == 'SignFreq')
      @key        = :freq
      @data[@key] = ''
    elsif(name == 'Name')
      @in_name = true
    end
  end

  def characters(str)
    @data[@key] += str if(@key && @in_name)
  end

  def end_element(name, attrs = [ ])
    if(name == 'DisorderSign')
      # Dump @data into the database here.
      @data = nil
    elsif(name == 'ClinicalSign')
      @key = nil
    elsif(name == 'SignFreq')
      @key = nil
    elsif(name == 'Name')
      @in_name = false
    end
  end
end

The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the

# Dump @data into the database here.

comment.

This structure makes it pretty easy to watch for the <Disorder id="17601"> elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.




回答2:


A SAX Parser is definitly what you want to be using. If you're anything like me and can't jive with the Nokogiri documentation, there is an awesome gem called Saxerator that makes this process really easy.

An example for what you are trying to do --

require 'saxerator'

parser = Saxerator.parser(Temp.xml)


parser.for_tag(:DisorderSign).each do |sign|
  signId = sign[:ClinicalSign][:id]
  name = sign[:ClinicalSign][:name]
  Symtom(:name => name, :id => signId).create!
end



回答3:


You're likely running out of memory because symptomsList is getting too large in memory size. Why not perform the SQL within the xpath loop?

require 'nokogiri'

sympFile = File.open("Temp.xml")
@doc = Nokogiri::XML(sympFile)
sympFile.close()

@doc.xpath("////DisorderSign").each do |x|
  signId = x.at('ClinicalSign').attribute('id').text()      
  name = x.at('ClinicalSign').element_children().text()
  Symptom.where(:name => name, :signid => signId.to_i).first_or_create
end

It's possible too that the file is just too large for the buffer to handle. In that case you could chop it up into smaller temp files and process them individually.




回答4:


You can also use Nokogiri::XML::Reader. It's more memory intensive that Nokogiri::XML::SAX parser but you can keep XML structure, e.x.

class NodeHandler < Struct.new(:node)
  def process
    # Node processing logic
    #e.x.
    signId = node.at('ClinicalSign').attribute('id').text()      
    name = node.at('ClinicalSign').element_children().text()

  end
end


Nokogiri::XML::Reader(File.open('./test/fixtures/example.xml')).each do |node|
  if node.name == 'DisorderSign' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
    NodeHandler.new(
        Nokogiri::XML(node.outer_xml).at('./DisorderSign')
    ).process
  end
end

Based on this blog



来源:https://stackoverflow.com/questions/19866226/parsing-large-xml-with-nokogiri

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!