I’ve recently had to parse out some semi-structured xml and marshall it into an object graph. I’d normally use JAXB in a heartbeat but the random, schemaless design of this particular large xml doc (full of random reuse of tag names inside other tags as you’ll see) made that pretty much impossible.

So I was originally thinking of doing it all by hand in StaX using the XmlEventReader. After all, it’s build into modern Java platforms and gives you the freedom to do what you want. But there’s the small matter of writing your own state-tracking using Stacks or whatnot.

The other night at  our local Canberra JUG we were talking about how great Commons Digester was for this stuff back in the day. Well, I figured I revisit it and it turns out Digester3 was just the ticket for my little problem. If you’re interested in a tutorial, there’s a great one here.

But one FAQ remained. I needed to slurp out a few of the nested nodes in the xml as an xml string. The docs are pretty lean on such things, but it turns out there’s plenty of magic in the framework to help you out.

First, you’ll need to take advantage of the NoteCreateRule object which invoke your parsed object with an xml Element of the matched node:

 forPattern("manual/part/chapter/section/controlsTitle/block/controls/block/content").
    addRule(new NodeCreateRule()).then().setNext("setDescriptionFromXml");

Once you have a handle to that bad boy, it’s just a matter of adding a method to your target bean object that takes such a beast. A little help from Stack Overflow, and I can turn that xml into a String and strip out the tags for my own nefarious reasons…

  // http://stackoverflow.com/questions/1219596/how-to-i-output-org-w3c-dom-element-to-string-format-in-java
    public void setDescriptionFromXml(Element element) throws Exception {
        TransformerFactory transFactory = TransformerFactory.newInstance();
        Transformer transformer = transFactory.newTransformer();
        StringWriter buffer = new StringWriter();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        transformer.transform(new DOMSource(element), new StreamResult(buffer));
        String str = buffer.toString().replaceAll("\\<.*?\\>", "").trim(); // strip all XML tags
        setDescription(str);

    }

Anyways, I thought it was worth writing up the process in case you ever need to use Commons Digester to get at the raw xml of a portion of your parse tree.