Processing large XML file with libxml-ruby chunk by chunk

问题

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:

   File.open('dblp.xml') do |io|
      dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
      pubFactory = PubFactory.new

      i = 0
      while dblp.read do
        case dblp.name
          when 'article', 'inproceedings', 'book': 
            pub = pubFactory.create(dblp.expand)
            i += 1
            puts pub
            pub = nil
            $stderr.puts i if i % 10000 == 0
            dblp.next
          when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
            # ignore for now
            dblp.next 
          else
            # nothing
        end
      end  
    end

The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?

Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?

def first(root, node)
    x = root.find(node).first
    x ? x.content : nil
end

pub.pages   = first(node,'pages') # node contains expanded node from dblp.expand

回答1:

When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:

Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.

I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs

Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .

回答2:

When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.

The event-based model is employed by the SAX-style parser, and derivative implementations.

Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html

REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

回答3:

I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like

my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!

Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.

Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.

来源：https://stackoverflow.com/questions/2000118/processing-large-xml-file-with-libxml-ruby-chunk-by-chunk

标签

ruby

stream

libxml-ruby