Wednesday, November 19, 2008

Mashup Camp 8

I participated in MashupCamp8 and created a simple mashup in a few hours which I have now open-sourced. It is a blog post editor, or article editor if you prefer, which finds key words in your text as you are writing and displays web search results for those key words next to the editor. Here's my elevator pitch:

When writing an article or blog post, I'll often note places where I should add a link and go back later to add them in. Sometimes searching for good resources to link to uncovers new information which I wish I had known while I was writing the article.

As I was walking around here at Mashup Camp, I saw a few web services which would allow me to do some data mining on raw text (like this blog post I'm writing) and search the web for relevant information.

I created a simple Ajax editor, which is hosted on Google App Engine. It takes the article's text, sends it securely (over HTTPS) to Open Calais which extracts key words then disambiguates, dedupes, and scores them for relevancy. Using this semantic web data from Calais, the editor then performs a web search on each of the keywords and requests results from the Yahoo Search BOSS API in the JSON format. The text analysis and web search is done automatically when the editor detects that you have stopped typing. To avoid excessive and distracting search refreshes, the editor will wait between three and eight seconds before performing the search.

You can try it out here:

Download the source code here.

This simple little mashup was well received, and I think part of the reason is the potential for growth. The fundamental idea, is to have an environment in which relevant data is brought to you as you work. You don't even need to go info hunting, our computers can find relevant information and bring it to you automagically.

For this particular mashup app, I had envisioned a few additional features, the addition of which is left as an exercise to the reader if you should so choose.
  • A WSIWYG editor to create rich HTML instead of just plain text. (I looked at tinyMCE but ran out of time.)
  • Search on combinations of keywords instead of just individual items identified by Calais.
  • Search other information sources, like news results, images, videos.
  • Search on multiple search engines. (Google has an easy to use Ajax search API, but I wanted to try something that was totally new to me.)

Tuesday, November 11, 2008

XML Library with Versioning

I recently created a simple Python library for converting objects to and from XML. Code samples up front, here's how you would define some class to represent some hierarchical XML:
class AtomFeed(XmlElement):
_qname = '{}feed'
title = Title
entries = [Entry]

class Entry(XmlElement):
_qname = '{}entry'
links = [Link]
title = Title
content = Content

class Link(XmlElement):
_qname = '{}link'
rel = 'rel'
address = 'href'

class Url(XmlElement):
_qname = '{}url'

class Title(XmlElement):
_qname = '{}title'
title_type = 'type'
Now for the whys and hows.

For the past few years I've been working with Web Services and most of them use XML to represent the data (though I hope JSON catches on more widely). There are some great XML libraries out there, and my library is based on one of them (ElementTree). XML parsing is certainly nothing new, so why create a new one?

The Why

There are a few limitations with the XML parsing approaches I've used in Python:
  • XML structure isn't documented or available using help()
  • No autocompete for finding elements in the XML
  • If the XML changes in a new version of the web service, my code needs to be rewritten
  • My code interacting with the XML is verbose

Source code can provide a wealth of information, but parsed XML doesn't have the same level of information richness as source code. Between tool tips in IDEs, auto-generated documentation, and autocomplete, having classes loaded for your XML models can bring the tree traversal logic closer to your fingertips. Many software development tools are optimized for working with predefined classes rather than generic XML objects.

However, one of the biggest drawbacks to representing each type of XML element with it's own class is that you end up needing to write lots of class definitions. For this reason I've tried to make the XML class definitions as compact as possible. Specifying a simple XML class only takes two lines of code. For each type of sub-element and each XML attribute, you can add one line of code. You don't need to declare all of the elements or attributes either. The XmlElement will preserve all of the XML which it parses. If there are class members which correspond to a specified sub-element, the element will be placed in that member. Any unspecified elements will be converted to XmlElement instances. You can search over all XML elements (both anticipated members and unanticipated generic objects) using the get_elements method. XML attributes are handled in a similar fashion and can be searched using get_attributes.

I've saved the most unique feature of this library for last: Sometimes web services change the XML definition thereby breaking your code. If it is something small like a change in XML namespace or changing a tag, it seems like such a waste to have to edit lines upon lines of code. To address this kind of problem, this XML library supports versioning. When you parse or generate XML, you can specify the version of the available rules that you'd like to use. You can use the same objects with any version of the web service.

To use versioning, write a class definition with tuples containing the version specific information:
class Control(XmlElement):
_qname = ('{}control', #v1
'{}control') #v2
draft = Draft
uri = 'atomURI'
lang = 'atomLanguageTag'
tag = ('control_tag', 'tag') # v1, v2

class Draft(XmlElement):
_qname = ('{}draft',
If you create an instance of the Control element like this:
c = Control(draft=Draft('yes'), tag='test')
Then you can generate XML for each version like this:
<control xmlns="" 
<control xmlns="" 
Note the difference in XML namespaces in the above. I also added an example of an attribute name which changed between versions, though "tag" doesn't actually belong in AtomPub control (so don't go trying to use it m'kay).

Since this library is open source, you're free to examine how it works and use it however you like. Allow me to highlight a few key points.

The How

Earlier I showed how to define XML element classes which look for specific sub elements and attributes and convert them into member objects. I also mentioned that this XML library handles versioning, meaning that the same object can parse and produce different XML depending on a version parameter. Both of these are accomplished by creating class level rule sets which are built up using introspection the first time an XML conversion is attempted.

In pseudo-code it works like this.
XML --> object
- find out the desired version
- is there an entry for this version in _rule_set?
- if not, look at all XML members of this class
in _members
- create XML matching rules based on each member's type
(and store in _rule_set so we don't need to generate
the rules again)
- iterate over all sub-elements in the XML tree
- sub-elements and attributes which are in the rule set
are converted into the declared type
- sub-elements and attributes which don't fit a rule are
stored in _other_elements or _other_attributes
When generating XML the process is similar but slightly different.
object --> XML
- create an XML tree with the tag and namespace for this
object given the desired version
- look at all members of this class in _members
- tell each member to attach itself to the tree using
it's rules for the desired version
- iterate through _other_elements and _other_attributes
and tell each to attach to the XML tree
Armed with the above explanation, understanding the source code should be a bit easier.

Wednesday, November 05, 2008

Vim Line Length

I happen to be a fan of the vi text editor. I also happen to be a fan of code style guidelines. And sometimes it is very handy to know if a line in your source code has gone over eighty characters long. Some style guides still recommend a max line length of eighty characters, a tradition dating all the way back to IBM punch cards, circa 1928. For an even more impressive feat of long term engineering influence look up how the size of Roman war chariots determined the size of NASA space shuttle booster rockets. But I digress.

To mark the characters which are over the eighty character limit, enter the following either within vim or in your .vimrc file.
:match ErrorMsg '\%>80v.\+'
I've noticed that some blogger templates tend to display fifty-five characters in a fixed width format (as in a <pre> block or <code>). To edit code which I plan to post on Blogger, I sometimes turn on
:match ErrorMsg '\%>55v.\+'