I used to be a hardcore DOM-style man, and dom4j remains my favourite XML API, but these days I’ve been doing a lot of performance and load sensitive stuff in the middle tier, which means that I tend to go for streaming solutions like XML pull and SAX. My basic lesson has been that streaming = performance. I mean, the numbers just don’t lie - we’re talking at least 3 times faster than any DOM-based solution in my benchmarking, often up to 10 times faster - depending on what you’re doing. The other big plus for SAX is that you get it for free with 1.4 so you don’t need to bundle stuff.

The painful thing about SAX (and XML pull), which puts most people off, is that you end up tracking your own state while processing things. I used to end up setting a whole bunch of booleans “nowInsideName = true”, “nowInsideAddress = true”, then I bumped into a guy at our local JUG who gave me a killer tip “Why don’t you use a Stack for state tracking?”. Duh! What a great idea.

So here’s the gist:

  • When you hit a start tag, push it onto the stack;* When you hit an end tag, pop the stack;* When you hit an “characters” or “text” entry, do a peek of the stack to see what element you’re in.

When I wrote the little app that converted my MoveableType XML over to Pebble blog format, I used this “Stack” type approach. I used XML pull, but exactly the same concepts work for SAX. Here’s an extract of the source:

        Stack tagStack = new Stack();
        int eventType = xpp.getEventType();
        while (eventType != xpp.END_DOCUMENT) {
            if (eventType == xpp.START_TAG) {
                tagStack.push(xpp.getName());

            } else if (eventType == xpp.END_TAG) {
                tagStack.pop();
                if (xpp.getName().equalsIgnoreCase(mte.ENTRY_TAG)) {
                    mtList.add(mte);
                    mte = new MoveableTypeEntry();
                }

            } else if (eventType == xpp.TEXT) {
                String tagName = (String) tagStack.peek();
                String tagValue = xpp.getText();
                if (tagName.equalsIgnoreCase(mte.TITLE_TAG)) {
                    mte.setEntryTitle(tagValue);
                } else if (tagName.equalsIgnoreCase(mte.TEXT_TAG)) {
                    mte.setEntryText(tagValue.replaceAll("n", "
n"));
                } else if (tagName.equalsIgnoreCase(mte.DATE_TAG)) {
                    mte.setEntryCreatedOn(tagValue);
                }
            }
            eventType = xpp.next();
        }

There are some constraints on this approach, right? You have to assume that you know some stuff about the documents structure (validating against a DTD can help here!), and in the example above, I don’t cater for elements of the same name in different portions of the document (Although you can do a Stack.search(parentTag); to find out that info if you need to confirm it).

Overall, I reckon this is a good way to go for SAX and pull type processing. I don’t know what other ways people do this sort of stuff, but it’s certainly a lot more maintainable than state tracking with a billion boolean fields!