I gather most people involved in microformats are coming from a background heavy in more formally structured data, e.g. RDF, XML, relational databases. I'm coming more from the opposite background: scraping. Recently Phil Jones described a web in which metadata resides in scraping/parsing applications meaning documents need not be so descriptive, and Danny Ayers predictably responded with an argument for the Semantic Web, in which metadata resides in documents meaning applications need not be so smart.
In Danny's comments, I tried to point out the applications Phil predicts can produce the documents Danny predicts. I already do a small amount of this with all my scrapers. On Disemployed, I add location and time information to each job posting and publish that information in a regular format (HTML, RSS 1, RSS 2, or Atom). I could admittedly be structuring this information more formally to better encourage reuse, but the data is there, in any case, where it wasn't before. But this is relatively simple data to add. I know when I found each job post and where it came from, so my application doesn't need to be very smart. What are the limits of a smart application? Could a very crafty application actually make microformats unnecessary?
Let's take one microformat, hCard, and see how guessable the microformat metadata would be if it weren't there, on a scale of zero to ten:
- fn (full name): this could at best be a guess. A name could feasibly contain pretty much any combination of letters. I'm sure someone somewhere has named a child "Asdf Jkl." Microformats are the easiest way to identify fn. 0/10.
- n (name): same here. 0/10.
- nickname: again, no easy way out. 0/10.
- photo: here we have a winner, mostly. I'm guessing eight times out of ten, any image referenced within something identified as hCard information will be a photo. Depending on how lucky we feel, microformats could be dispensed with here. 7/10.
- bday (birthday): this is a bit complicated. Dates follow very standard formats, and we could probably identify dates in a jumble of text with about 95% accuracy. But how do we know if a given date is a birthday? We can assume relatively safely based on proximity to words like "birthday." 9/10.
- adr (address): I would have guessed this would be very hard to identify as a pattern, but Google is already doing this. Of course, Google is limiting to US addresses. 5/10.
- label: at first, this appears to be as open-ended as names, but the variety in practice is likely very limited. I would expect a list of a few dozen words likely to occur in a label (e.g. home, domestic, etc.) would catch maybe 7/10.
- tel (telephone): this is a bit complicated. Having an address makes it much easier to tell if a given set of numbers is likely a phone number. Capturing anything that fits the patterns (###) ###-#### or ###-###-#### would get many phone numbers, and I suspect more is possible. 6/10.
- email: This one is easy. An email address must fit a defined pattern, so we can discover all email addresses with no microformat, as evidenced by the proliferation of junk email. 10/10.
- mailer: At any given time, there are only so many known email clients. 8/10.
- tz (time zone): There are only so many timezones, and not too many ways to describe them. 9/10.
- geo: Latitude and Longitude information is pretty much useless if it doesn't follow a certain pattern (decimal numbers between -180 and 180), but that doesn't mean all numbers that follow this pattern are geo codes. 6/10.
- title: Theoretically unlimited, but practically limited. 7/10.
- role: Words ending in "er" would catch a lot. Check for proximity to words like "job," "work," or "professional." 5/10.
- logo: Just like photo, only probably smaller. 7/10.
- agent: I had to look this one up. Auto-discovery doesn't look good. 0/10.
- org: Just like names, only worse. 0/10.
- categories: Could be anything. 0/10.
- note: Again, anything. 0/10.
- rev: Dates near words like "updated" or "modified." 7/10.
- sort-string: Usually last word in the name. 6/10.
- sound: Sounds have defined formats. 10/10.
- uid: Pass. 0/10.
- url: First standard link. 7/10.
- class: Pass. 0/10.
- key: Keys follow patterns. 10/10.
Average: 5.3/10. In general there are some areas in which microformats are entirely unneccesary, some in which they are entirely necessary, and some in between. Of course, these are mostly rough estimates on the potential accuracy of intelligent scraping. The actual accuracy would need to be determined by writing a scraper and pitting it against some actual data.
In any case, microformats appear well worth the expense to capture that 47% (or however much) of the existing information. Even though email addresses are entirely identifiable without any microformat, as long as we're wrapping names in name tags, it makes sense to wrap the email addresses at the same time so a parser doesn't need to be any smarter.
While not the absolute simplest method, microformats appear to be the lowest common denominator of structured documents. So now I think I was wrong when I wrote that
we're headed towards a "semantic web" in which the semantics are forced onto websites by browsers and other intermediaries. I still expect that will happen (as I notice it happening, and cause it to happen), but given the practical limits of the smart-application method of connecting the world of information, it will only work as a bridge to a semantic web composed of metadata-rich documents.