As part of a recent CJUG XML Smackdown we played with processing a very large (2Gb) XML file using SAX & StAX. I was doing the StAX bit (using Woodstox), and I learnt some very interesting lessons on processing large XML files along the way.

The file we used to demo was a large rdf file from the dmoz directory. Its compressed (gzip) size is 299Mb, but it unzips to just under 2Gb. The test was to iterate all elements and pull the urls out of the “about” attribute.

I tried the demo on three machines. A Mac Mini (290secs), An old Thinkpad (370secs), and a newish AMD 2.4 Linux box (80secs). The results were very interesting - I/O was definitely the killer on the slower boxes. To experiment further, I changed the code to read from the original 300Mb file using a GZipInputStream - and the Thinkpad came in at 270secs - I wiped off nearly a third of the time! No such luck on the AMD box (which has a SATA hard disk) - where throughput actually slowed when using the Gzip method.

Interestingly the Sax performance was nearly identical in all tests (mostly slightly faster than Stax, but sometimes slightly slower - but either way not really significant given the amount of processing).

So the learning experience for me:

It’s sure no empirical test - really just a back-of-a-napkin kinda thing. But it was fun to demo, and gave me my first ever chance to use a GZipInputStream!