A Python RDFa Parser

If you’re starting to look at RDFa you may already have come across one of the key design concepts–that it is ‘generic’. What we mean by that is that once you know the handful of basic rules that make up RDFa, you can add any type of metadata you like to your XHTML documents. And because the rules are fixed, it’s clear what happens when more than one vocabulary is used in a document.

The advantage of this approach is that you only need one parser; to make available to some processor the metadata contained in your document simply requires applying the RDFa rules, without needing to know anything about what the markup is meant to mean. To give an example, in the hCard microformat you can say this:

<a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>

But to understand which of the values in the class attribute apply to the @href and which to the content of the anchor (or both or neither) you need to look at the hCard specification–in other words you need a specialised hCard processor. This applies to each microformat produced, since they are all–much like GRDDL processors–specialised processors.

RDFa has from the beginning taken a different approach and the aim has always been to create a set of rules that are independent of any particular metadata language. The RDFa equivalent of the previous example would be:

<a rel="email" property="fn" href="mailto:jfriday@host.com">Joe Friday</a>

In XHTML the rel attribute already does the job we need it to do, namely providing some metadata about the relationship between the current document and some other resource–so it’s clear that email qualifies @href.

Unfortunately, there is nothing straightforward in XHTML that can be used to flag up the text value. The microformats approach is to use the class attribute, but whilst this logically tells us something about the object that has the class (a span or a, for example), it doesn’t feel right to say that this is also a property of the document that contains the mark-up. (In RDF terms, if we say that “Mark is of type author”, that doesn’t necessarily mean “This document has an author of Mark”.)

We therefore decided in RDFa to make further use of the attribute that is already used on the meta element–@property–since in RDFa we always try to build on what authors already know how to do, and are comfortable with.

Although I said above that “you only need one parser”, I didn’t of course mean that you only need one parser! I just meant that since the process is clearly defined, and is independent of any vocabulary, then once you have written your parser you don’t need to write a new one when someone creates a new set of metadata terms. Whilst we were working on the language, the only parsers for RDFa were XSLT-based. However, recently, Ben Adida implemented a JavaScript parser (or to be more precise, a flexible parsing framework), and yesterday Elias Torres announced a Python parser.

Elias’ parser is pretty impressive in its own right, but it is particularly important because it is the first parser to be written by someone not involved in writing the specification…and has therefore shown up some glaring inconsistencies in the spec! Not content with that, Elias has also gone on to create a web service that will retrieve any URL you give it, parse the RDFa, and give you back the RDF/XML.

This is all excellent work, and a substantial taste of what RDFa will be able to deliver.

Leave a Reply

You must be logged in to post a comment.