How can I efficiently parse 200,000 XML files in Java?

问题

I have 200,000 XML files I want to parse and store in a database.

Here is an example of one: https://gist.github.com/902292

This is about as complex as the XML files get. This will also run on a small VPS (Linode) so memory is tight.

What I am wondering is:

1) Should I use a DOM or SAX parser? DOM seems easier and faster since each XML is small.

2) Where is a simple tutorial on said parser? (DOM or SAX)

Thanks

EDIT

I tried the DOM route even though everyone suggested SAX. Mainly because I found an "easier" tutorial for DOM and I thought that since the average file size was about 3k - 4k it would easily be able to hold that in memory.

However, I wrote a recursive routine to handle all 200k files and it gets about 40% of the way through them and then Java runs out of memory.

Here is part of the project. https://gist.github.com/905550#file_xm_lparser.java

Should I ditch DOM now and just use SAX? Just seems like with such small files DOM should be able to handle it.

Also, the speed is "fast enough". It's taking about 19 seconds to parse 2000 XML files (before the Mongo insert).

Thanks

回答1:

SAX always beats DOM at speed. But since you say XML files are small you may proceed with DOM parser. One thing you can do to speedup is create a Thread-Pool and do the database operations in it. Multithreaded updates will significantly improve the performance.

Lalith

回答2:

Why not use a proper XML database (like Berkeley DB XML)? Then you can just dump the documents in directly, and create indices as needed (e.g. on the HotelID).

回答3:

divide and conquer Split 200,000 files into multiple buckets and parallelize the parse/insert. Look at Java 5 Executors if you want to keep it simple or use spring-batch if this is a recurring task in which case you can benefit from a high level framework.

API

Use of SAX can help but not necessary as you are not going to keep the parsed model around (i.e. all you are doing is parsing, inserting and then let go of the parsed data at which time the objects are eligible for GC). Look into a simple API like JDOM.

Other ideas

You can implement a producer/consumer kind of model where producer produces the pojo's created after parsing and consumer takes the pojo's and inserts them into db. the advantage here is that you can batch the inserts to gain more performance.

回答4:

Go with SAX, or if you want, StAX. Forget about DOM. Use an effective library like aalto.

I am sure that parsing will be quite cheap compared to making the database requests.

But 200k is not such a big number if you only need to do this once.

回答5:

SAX will be faster than DOM, this could well be an issue if you have 200,000 files to parse.

回答6:

StAX is faster then SAX and this is much faster then DOM. If performance is super critical you can also think about building a special compiler to parse the XML files. But usually lexing and parsing is not that much of an issue with StAX but the "after-processing".

来源：https://stackoverflow.com/questions/5545619/how-can-i-efficiently-parse-200-000-xml-files-in-java

标签

java

xml

xml-parsing