Tuesday, November 11, 2008

XML Library with Versioning

I recently created a simple Python library for converting objects to and from XML. Code samples up front, here's how you would define some class to represent some hierarchical XML:
class AtomFeed(XmlElement):
_qname = '{http://www.w3.org/2005/Atom}feed'
title = Title
entries = [Entry]

class Entry(XmlElement):
_qname = '{http://www.w3.org/2005/Atom}entry'
links = [Link]
title = Title
content = Content

class Link(XmlElement):
_qname = '{http://www.w3.org/2005/Atom}link'
rel = 'rel'
address = 'href'

class Url(XmlElement):
_qname = '{http://www.w3.org/2005/Atom}url'

class Title(XmlElement):
_qname = '{http://www.w3.org/2005/Atom}title'
title_type = 'type'
Now for the whys and hows.

For the past few years I've been working with Web Services and most of them use XML to represent the data (though I hope JSON catches on more widely). There are some great XML libraries out there, and my library is based on one of them (ElementTree). XML parsing is certainly nothing new, so why create a new one?

The Why

There are a few limitations with the XML parsing approaches I've used in Python:
  • XML structure isn't documented or available using help()
  • No autocompete for finding elements in the XML
  • If the XML changes in a new version of the web service, my code needs to be rewritten
  • My code interacting with the XML is verbose

Source code can provide a wealth of information, but parsed XML doesn't have the same level of information richness as source code. Between tool tips in IDEs, auto-generated documentation, and autocomplete, having classes loaded for your XML models can bring the tree traversal logic closer to your fingertips. Many software development tools are optimized for working with predefined classes rather than generic XML objects.

However, one of the biggest drawbacks to representing each type of XML element with it's own class is that you end up needing to write lots of class definitions. For this reason I've tried to make the XML class definitions as compact as possible. Specifying a simple XML class only takes two lines of code. For each type of sub-element and each XML attribute, you can add one line of code. You don't need to declare all of the elements or attributes either. The XmlElement will preserve all of the XML which it parses. If there are class members which correspond to a specified sub-element, the element will be placed in that member. Any unspecified elements will be converted to XmlElement instances. You can search over all XML elements (both anticipated members and unanticipated generic objects) using the get_elements method. XML attributes are handled in a similar fashion and can be searched using get_attributes.

I've saved the most unique feature of this library for last: Sometimes web services change the XML definition thereby breaking your code. If it is something small like a change in XML namespace or changing a tag, it seems like such a waste to have to edit lines upon lines of code. To address this kind of problem, this XML library supports versioning. When you parse or generate XML, you can specify the version of the available rules that you'd like to use. You can use the same objects with any version of the web service.

To use versioning, write a class definition with tuples containing the version specific information:
class Control(XmlElement):
_qname = ('{http://purl.org/atom/app#}control', #v1
'{http://www.w3.org/2007/app}control') #v2
draft = Draft
uri = 'atomURI'
lang = 'atomLanguageTag'
tag = ('control_tag', 'tag') # v1, v2

class Draft(XmlElement):
_qname = ('{http://purl.org/atom/app#}draft',
'{http://www.w3.org/2007/app}draft')
If you create an instance of the Control element like this:
c = Control(draft=Draft('yes'), tag='test')
Then you can generate XML for each version like this:
c.to_string(1)
returns
<control xmlns="http://purl.org/atom/app#" 
control_tag="test">
<draft>yes</draft>
</control>
while
c.to_string(2)
returns
<control xmlns="http://www.w3.org/2007/app" 
tag="test">
<draft>yes</draft>
</control>
Note the difference in XML namespaces in the above. I also added an example of an attribute name which changed between versions, though "tag" doesn't actually belong in AtomPub control (so don't go trying to use it m'kay).

Since this library is open source, you're free to examine how it works and use it however you like. Allow me to highlight a few key points.

The How

Earlier I showed how to define XML element classes which look for specific sub elements and attributes and convert them into member objects. I also mentioned that this XML library handles versioning, meaning that the same object can parse and produce different XML depending on a version parameter. Both of these are accomplished by creating class level rule sets which are built up using introspection the first time an XML conversion is attempted.

In pseudo-code it works like this.
XML --> object
- find out the desired version
- is there an entry for this version in _rule_set?
- if not, look at all XML members of this class
in _members
- create XML matching rules based on each member's type
(and store in _rule_set so we don't need to generate
the rules again)
- iterate over all sub-elements in the XML tree
- sub-elements and attributes which are in the rule set
are converted into the declared type
- sub-elements and attributes which don't fit a rule are
stored in _other_elements or _other_attributes
When generating XML the process is similar but slightly different.
object --> XML
- create an XML tree with the tag and namespace for this
object given the desired version
- look at all members of this class in _members
- tell each member to attach itself to the tree using
it's rules for the desired version
- iterate through _other_elements and _other_attributes
and tell each to attach to the XML tree
Armed with the above explanation, understanding the source code should be a bit easier.

No comments: