Reading Huge XML File using StAX and XPath

问题

The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and send it to processing system.

The sample content of the file

<transactions>
    <txn id="1">
      <name> product 1</name>
      <price>29.99</price>
    </txn>

    <txn id="2">
      <name> product 2</name>
      <price>59.59</price>
    </txn>
</transactions>

The (technical)user is expected to give the input tag name like <txn>.

We would like to provide this solution to be more generic. The file content might be different and users can give a XPath expression like "//transactions/txn" to pick individual transactions.

There are few technical things we have to consider here

The file can be in a shared location or FTP
Since the file size is huge, we can't load the entire file in JVM

Can we use StAX parser for this scenario? It has to take XPath expression as a input and pick/select transaction XML.

Looking for suggestions. Thanks in advance.

回答1:

Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.

Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

回答2:

If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]

As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:

  Document document = getParser().parse( source );

After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.

SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.

This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.

回答3:

We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.

I bogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.

回答4:

It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)

In this situation, you will have to use

  <p:for-each>
    <p:iteration-source select="//transactions/txn"/>
    <!-- you processing on a small file -->
  </p:for-each>

You can even wrapp each of the resulting transformation with a single line of XProc

  <p:wrap-sequence wrapper="transactions"/>

Hope this helps

回答5:

A fun solution for processing huge XML files >10GB.

Use ANTLR to create byte offsets for the parts of interest. This will save some memory compared with a DOM based approach.
Use Jaxb to read parts from byte position

Find details at the example of wikipedia dumps (17GB) in this SO answer https://stackoverflow.com/a/43367629/1485527

回答6:

Streaming Transformations for XML (STX) might be what you need.

回答7:

Do you need to process it fast or you need fast lookups in the data ? These requirements need different approach.

For fast reading of the whole data StAX will be OK.

If you need fast lookups than you could need to load it to some database, Berkeley DB XML e.g.

来源：https://stackoverflow.com/questions/7215931/reading-huge-xml-file-using-stax-and-xpath

标签

java

xml

xpath

stax