I\'m trying to figure out how to parse some XML (for an Android app), and it seems pretty ridiculous how difficult it is to do in Java. It seems like it requires creating an
Starting w/ Java 5, there is an XPath library in the SDK. See this tutorial for an introduction to it.
Writing SAX handler
is the best way to go. And once you do that you will never go back to anything else. It's fast, simple and it crunches away as it goes, no sucking large parts or god forbid a whole DOM into memory.
Well parsing XML is not an easy task.
Its basic structure is a tree with any node in tree capable of holding a container which consists of an array of more trees.
Each node in a tree contains a tag and a value but in addtion can contain an arbitary number of named attributes, and, an arbitary number of children or containers.
XML parsing tasks tend to fall in to three catagories.
Things that can be done with "regex". E.g. you want to find the value of the first "MailTo" tag and are not interested in the contents of any other tags.
Things you can parse yourself. The xml structure is always very simple e.g a root node and ten well known tags with simple values.
All the rest! Even though an xml message format can look deceptively simple home made parsers are easily confused by extra attributes, CDATA and unexpected children. Full blown XML parsers can handle all of these situations. Here the basic choice is between a stream or a DOM parser. If you intend to use most of the entities/attributes given in the order you want to use them then a DOM parser is ideal. If you are only interested in a few attributes and intend to use them in the order they are presented, if you have performance constraints, or, if the xml files are large ( > 500MB ) than a stream parser is the way to go; the callback mechanism takes a bit of "groking" but its actually quite simple to program once you get the hang of it.
Kyle,
(Please excuse the self-promotey nature of this post... I've been working on this library for months and it's all open source/Apache 2, so not that self-serving, just trying to help).
I just released a library I'm calling SJXP or "Simple Java XML Parser" http://www.thebuzzmedia.com/software/simple-java-xml-parser-sjxp/
It is a very small/tight (4 classes) abstraction layer that sits on top of any spec-compliant XML Pull Parser.
On Android and non-Android Java platforms, pull parsing is probably one of the most performant (both in speed and low memory overhead) methods of parsing. Unfortunately coding directly against a pull-parser ends up looking a lot like any other XML parsing code (e.g. SAX) -- you have exception handlers, maintaining parser state, error checking, event handling, value parsing, etc.
What SJXP does is allows you to define XPath-like "paths" in a document of the elements or attributes you want the values from, like:
/rss/channel/title
and it will invoke your callback, with the value, when that rule matches. The API is really straight forward and has intuitive support for namespace-qualified elements if that is what you are trying to parse.
The code for a standard parser would look something like this (an example that parses an RSS2 feed title):
IRule titleRule = new DefaultRule(Type.CHARACTER, "/rss/channel/title") {
@Override
public void handleParsedCharacters(XMLParser parser, String text) {
// Store the title in a DB or something fancy
}}
then you just create an XMLParser instance and give it all the rules you want it to care about:
XMLParser parser = new XMLParser(titleRule);
parser.parse(xmlStream);
And that's it, the parser will invoke the handler method every time the rule matches. You can stop parsing at any time by calling parser.stop() if you want.
Additionally (and this is the real win of this library) matching namespace qualified elements and attributes is dead easy, you just add their namespace URI inside of brackets prefixing the name of the element in your path.
An example, say you want out of the element for an RSS feed so you can tell what language it is in (ref: http://web.resource.org/rss/1.0/modules/dc/). You just use the unique namespace URI for that 'language' element with the 'dc' prefix, and the rule path ends up looking like this:
/rss/channel/[http://purl.org/dc/elements/1.1/]language
The same goes for namespace-qualified attributes as well.
With all that ease, the only overhead you add to the parsing process is an O(1) hash lookup at each location of the XML document and a few-hundred bytes, maybe 1k, for the internal location state of the parser.
The library works on Android with no additional dependencies (because the platform provides an org.xmlpull impl already) and in any other Java runtime by adding the XPP3 dependency.
This library is the result of many months of writing custom pull parsers for every kind of feed XML out there in every language and realizing (over time) that about 90% of parsing can be distilled down into this really basic paradigm.
I hope you find it handy.
A couple of weeks ago I battered out a small library (a wrapper around javax.xml.stream.XMLEventReader
) allowing one to parse XML in a similar fashion to a hand-written recursive descent parser. The source is available on github, and a simple usage example is below. Unfortunately Android doesn't support this API but it is very similar to the XmlPullParser
API, which is supported, and porting wouldn't be too time-consuming.
accept("tilesets");
while (atTag("tileset")) {
String filename = attrib("file");
File tilesetFile = new File(filename);
if (!tilesetFile.isAbsolute()) {
tilesetFile = new File(FilenameUtils.concat(file.getParent(), filename));
}
int tilesize = Integer.valueOf(attrib("tilesize"));
Tileset t = new Tileset(tilesetFile, tilesize);
t.setID(attrib("id"));
tilesets.add(t);
accept();
close();
}
close();
expect("map");
int width = Integer.valueOf(attrib("width"));
int height = Integer.valueOf(attrib("height"));
int tilesize = Integer.valueOf(attrib("tilesize"));
You can try this
http://xml.jcabi.com/
It is is an extra layer on top of DOM that allows simple parsing, printing, and transforming of XML documents and nodes