Tuesday, October 16, 2007

How would you compare XMLs?

I am supposed to deal with something similar to comparing two gigantic XML documents in wild ways.

I can think of several upfront techniques to achieve it which might involve performance and maintainability trade offs. As you might know, writing code for parsing XML by hand was the the activity of ancient times (hey there are you still writing code for parsing XML?), today we've plethora of tools to parse, bind and persist XML with very less pain. I came across several XML binding libraries like JAXB2.0, XMLBeans, JiBX etc (and given a change why not EMF?). JiBX seams to be interesting but since I'm bounded by not using open source at will, I tried JAXB2.0. The XML schema provided to me was a huge XSD document, the JAXB binding compiler spitted 550 Java classes out of that.

A test driven simple recursive depth-first reflective (opps, too many adjectives) traversal algorithm on the generated object tree sufficed the requirements to identify XML delta information. This was very obvious and pretty fast solution (fast to develop), the downside is, it would require maintenance of 550 generated classes, though they can be regenerated and synchronized with the help of XJC ant task but still the memory foot print and object creation time can be circumscribing for production code.

The other approach I tried was calculating XML diffs using XML processing. I found a nice little utility library XMLUnit among others, which does almost the same what I want. XMLUnit is a tool primarily for unit testing XML-intensive applications, It is very small with clean API and well documented (if you want to read i.e.). There are several utility classes which shields you from looking/writing ugly XML processing code which I used to get the XML diffs. Although I need to poke around on XMLwith XPath still because of the complex requirements.

I would have tried my favourite XStream as well but FAQs suggests me not to, anyway, What would be your strategy to deal with something like this ?

9 comments:

mccoyn said...

I did have an application where I had to automatically merge two XML files as part of an upgrade (so that the new file got the new fields.) In theory it was simple. If an element was missing add it and all its sub-nodes. If an element had the same name in both files then recurse for all its children.

The problems arose when some elements actually were ordered list. Merging would merge the lists when it really should completely replace it. Next were some elements that could only by uniquely identified by inspecting sub-elements or attributes. There was no way to auto-detect these situations without adding some meta-data. Ultimately, the simple auto-merge turned out to be an inefficient solution.

Nirav Thaker said...

Thanks for the feedback Nicholas,

I knew about ordering problems while comparing ordered node list, fortunately for me though, node ordering is guaranteed by the source XML (which is generated) so I can compare it without deep node/attribute comparisons, possible guesses and meta-data.

XMLUnit doesn't handle this case.

Unknown said...

Hi, we had a very similar requirement on a project I was recently involved with. The project included implementing a bunch of document based web services. We wrote unit tests that called the web services and then compared the returned document against an expected response. We couldn't guarantee element order, white space equivalence, etc. so tried XMLUnit without much success. In the end, I wrote some XSLT that recursed through the document tree, comparing each element to it's equivalent in the source file. The output of the stylesheet was a report of all errors encountered. Once the stylesheet was precompiled, execution time was very quick and we were perfectly happy with the solution.

Nirav Thaker said...

Hi Stuart,

XMLUnit does ignore whitespace while comparing (org.custommonkey.xmlunit.XMLUnit#setIgnoreWhitespace).

But I seem to agree with you, XSLT would be the ultimate tool for this job.

Mondain said...

I have used XMLUnit for this sort of task as well. I had to generate xml deltas and the library worked great with almost no tweaking.

Unknown said...

Hi,

i have a questions you might be able to help me with: i have to compare two xmls (lets say the reference and the new one), but i know some fields are NOT going to be the same, so i would like to ignore them.

How would you suggest doing this?

Thanks you!
Lea

Nirav Thaker said...

Lea,

If you mean how you would handle that situation using XmlUnit then I would say you have to tweak it a bit, because I dont think it supports configurable exclusions.

If you are asking in general then its fairly easy to ignore tags in either DOM or SAX parsing.

Hope that helps.

Sathish said...

Hi,
I want to find the delta between two xml files. The delta should be in another xml file which describes the added or removed elements. How to achieve this using XMLUnit?
Pls, Help me!!

Javin @ remove symbolic link in linux said...

how about using XML-Spy or writing program using Xpath ? by the way I have also blogged my experience as comparator and comparable in java with example any advice will be highly welcomed.