I gather most people involved in microformats are coming from a background heavy in more formally structured data, e.g. RDF, XML, relational databases. I'm coming more from the opposite background: scraping. Recently Phil Jones described a web in which metadata resides in scraping/parsing applications meaning documents need not be so descriptive, and Danny Ayers predictably responded with an argument for the Semantic Web, in which metadata resides in documents meaning applications need not be so smart.

In Danny's comments, I tried to point out the applications Phil predicts can produce the documents Danny predicts. I already do a small amount of this with all my scrapers. On Disemployed, I add location and time information to each job posting and publish that information in a regular format (HTML, RSS 1, RSS 2, or Atom). I could admittedly be structuring this information more formally to better encourage reuse, but the data is there, in any case, where it wasn't before. But this is relatively simple data to add. I know when I found each job post and where it came from, so my application doesn't need to be very smart. What are the limits of a smart application? Could a very crafty application actually make microformats unnecessary?

Let's take one microformat, hCard, and see how guessable the microformat metadata would be if it weren't there, on a scale of zero to ten:

  • fn (full name): this could at best be a guess. A name could feasibly contain pretty much any combination of letters. I'm sure someone somewhere has named a child "Asdf Jkl." Microformats are the easiest way to identify fn. 0/10.
  • n (name): same here. 0/10.
  • nickname: again, no easy way out. 0/10.
  • photo: here we have a winner, mostly. I'm guessing eight times out of ten, any image referenced within something identified as hCard information will be a photo. Depending on how lucky we feel, microformats could be dispensed with here. 7/10.
  • bday (birthday): this is a bit complicated. Dates follow very standard formats, and we could probably identify dates in a jumble of text with about 95% accuracy. But how do we know if a given date is a birthday? We can assume relatively safely based on proximity to words like "birthday." 9/10.
  • adr (address): I would have guessed this would be very hard to identify as a pattern, but Google is already doing this. Of course, Google is limiting to US addresses. 5/10.
  • label: at first, this appears to be as open-ended as names, but the variety in practice is likely very limited. I would expect a list of a few dozen words likely to occur in a label (e.g. home, domestic, etc.) would catch maybe 7/10.
  • tel (telephone): this is a bit complicated. Having an address makes it much easier to tell if a given set of numbers is likely a phone number. Capturing anything that fits the patterns (###) ###-#### or ###-###-#### would get many phone numbers, and I suspect more is possible. 6/10.
  • email: This one is easy. An email address must fit a defined pattern, so we can discover all email addresses with no microformat, as evidenced by the proliferation of junk email. 10/10.
  • mailer: At any given time, there are only so many known email clients. 8/10.
  • tz (time zone): There are only so many timezones, and not too many ways to describe them. 9/10.
  • geo: Latitude and Longitude information is pretty much useless if it doesn't follow a certain pattern (decimal numbers between -180 and 180), but that doesn't mean all numbers that follow this pattern are geo codes. 6/10.
  • title: Theoretically unlimited, but practically limited. 7/10.
  • role: Words ending in "er" would catch a lot. Check for proximity to words like "job," "work," or "professional." 5/10.
  • logo: Just like photo, only probably smaller. 7/10.
  • agent: I had to look this one up. Auto-discovery doesn't look good. 0/10.
  • org: Just like names, only worse. 0/10.
  • categories: Could be anything. 0/10.
  • note: Again, anything. 0/10.
  • rev: Dates near words like "updated" or "modified." 7/10.
  • sort-string: Usually last word in the name. 6/10.
  • sound: Sounds have defined formats. 10/10.
  • uid: Pass. 0/10.
  • url: First standard link. 7/10.
  • class: Pass. 0/10.
  • key: Keys follow patterns. 10/10.

Average: 5.3/10. In general there are some areas in which microformats are entirely unneccesary, some in which they are entirely necessary, and some in between. Of course, these are mostly rough estimates on the potential accuracy of intelligent scraping. The actual accuracy would need to be determined by writing a scraper and pitting it against some actual data.

In any case, microformats appear well worth the expense to capture that 47% (or however much) of the existing information. Even though email addresses are entirely identifiable without any microformat, as long as we're wrapping names in name tags, it makes sense to wrap the email addresses at the same time so a parser doesn't need to be any smarter.

While not the absolute simplest method, microformats appear to be the lowest common denominator of structured documents. So now I think I was wrong when I wrote that we're headed towards a "semantic web" in which the semantics are forced onto websites by browsers and other intermediaries. I still expect that will happen (as I notice it happening, and cause it to happen), but given the practical limits of the smart-application method of connecting the world of information, it will only work as a bridge to a semantic web composed of metadata-rich documents.

 

I recently worked on a website for the Iowa Military Veterans Band for my day job. It's a static site, which is not my primary interest. Making static websites is more interesting to me than most other tasks, but I'd much rather be working on something dynamic and functional. So I made the site functional in ways no one will ever use.

If you take any page with contact or calendar information from the IMVB site and feed it into X2V, you'll get the relevant information as vCard or iCal, which you can then import into most address book and calendar applications. Which admittedly seems pretty useless at first given the unlikelihood that anyone would want to import such information into a desktop application.

I did this mostly to test out the usefulness of microformats. I had been reading about microformats for a few weeks, so I thought this would be a good opportunity to try it out. I probably spent about 20 minutes extra time adding and testing the microformats, which is relatively little given the enormous time savings for the first person who wants to import all ninety-some IMVB members into her address book.

And this is only what can be done with microformats today. I imagine a future in which X2V is unnecessary because microformat readers are built into browsers. Where Safari and Firefox today recognize syndication feeds and allow users to import that information into a suitable application with a single click, future browsers could do the same with various microformats.

Unfortunately, this future will likely be slow coming, as microformats suffer from the same chicken-egg problem that made syndication adoption so slow: nobody wants a reader application with no content, and nobody cares to produce content with no readers. But because microformats are starting mostly with existing formats like vCard and iCal (and soon Atom), perhaps the future won't be so slow to arrive. In any case, I've done my part to spread microformats and create a more semantic web, and I see no reason not to continue doing so in the future.