Tuesday, May 08, 2007

[Pub] XML Binding

In the current issue of Infoweek.ch I describe options to handle XML data with Java. "Traditionally" well known strategies are SAX (simple API for XML parsing) and DOM (document object model) parser. While parsers following these strategies are still used, and there are good reasons to do so, in the last years several new concepts appeared on the scene. Let's give a brief review of XML parsing history and (possible) future:

SAX (simple API for XML parsing)

SAX is the "oldest" concept and is still used directly and as a basis for other libraries like many DOM libraries. SAX follows a callback strategy: a class implements a callback interface and this class is handed over to the SAX parser. When the parser is started it "rushes through" the XML document from top to bottom and sends events (using the callback interface methods) to the class. These events are like: "Start Document", "Start Element", "characters", "End Element", ... "End Document".

SAX is very fast and needs few memory as it is not required to keep the whole document in memory; however, accessing the data can be somewhat awkward for many applications. It is actually a rather low-level interface, but might be still useful for applications where mostle "linear read" is required and highest performance is the issue. A lot of implementations are available, also as part of the Java API.

DOM (document object model)

DOM parser like dom4j oder jdom are actually libraries build on top of SAX: they read the SAX events and build a generic object tree represanting the XML document. Navigation is easy, the whole document can be accessed and also changed. Some libraries also allow the usage of query languages like XPath. Hence DOM strategies are "general purpose" parsers, but with some disadvantages that should be considered:
  • The whole document is kept in memory, i.e., much higher memory consumption than with SAX
  • Document is built from generic objects like "Document"; "Element"
  • Data access typically is limited to "String-type" access
XML Binding

The "youngest" technology is XML-binding. The general idea is similar to object relational mapping: domain objects are not "manually" transfered to XML data, but a binding between domain object and XML structure(s) is defined. Then the XML binding library takes care for serialisation to and deserialisation from XML data.

Typically the basis for the mapping is a W3C schema that describes the XML structure plus additional mapping information. Then the binding library and tools can help in generating domain classes when needen. To give an example:



This illustration shows how SAX fires events to the callback class, DOM builds a tree of generic "Element" classes containing the content of the XML document, whereas the binding framework actually creates a tree of concrete domain objects. Despite of the fact, that this is apparently more elegant than a generic tree it also has some tangible advantages like: Direct work with domain objects is possible, no intermediate layer has to be programmed. Binding is type-safe: DOM and SAX libraries are usually working only with String datatypes, whereas binding framework allow arbitrary types.

The drawback however is, that binding frameworks are more complex to understand than DOM libraries and the initial effort (creating schemas, binding definitions, code generation) is somewhat higher.

Binding Frameworks

Leading XML binding frameworks in the Java domain are Apache XMLBeans, Codehaus Castor, JIBX and JAXB from Sun. Beside these "general purpose" frameworks, there are specialised binding libraries used at Webservice framworks like Apache Axis 2 (ADB) and Codehaus XFire (Aegis). They are supposed to be simpler to understand compared to a "full-blown" binding framework like XMLBeans.

Castor has the advantage that the framework contains an O/R mapper as well as an XML binding library, hence if both things are needed in a project, Castor might be the right choice. XMLBeans on the other hand got some attention the last years, as it appears to be the most powerful library available.

XML Beans in the recent version does not only support binding, but also has "low-level" XML interfaces named Cursor and Token, that are comparable with DOM libraries. Hence it is possible to work with "different perspectives" on XML data: binding on the one hand, full XML infoset access down to whitespaces and XML coments on the other. Also XQuery and XPath are supported to query XML data. The roadmap of XMLBeans plans streaming XMLBeans to overcome the disadvantage, that full documents have to be kept in memory.

So it appears that the future of XML processing might be hybrid frameworks like XML beans, that allow to change the access strategy as it is needed for the very problem to be solved.

No comments: