Yesterday I made a Greasemonkey script to detect telephone numbers in hCards and wrap them in callto: links to launch VOIP tools (e.g. Skype). This is the kind of thing I do to satisfy my own curiosity, assuming no one will ever use. Another such project was the Google hCalendar Greasemonkey script I did a while back, except I did that one in exchange for a free book. And it's now being used by Yahoo.
The Western Iowa Advantage website went "live" (no longer a placeholder) yesterday. I’ve been working on it, along with other people and among other projects, for the past month or so. Everyone at work seems pretty excited about the result. I suspect the enthusiasm is largely due to the visual look of the site. It’s pretty. People like pretty.
But what I find most interesting about the site is something no one else will ever notice: it’s very semantic. The markup describes the data. The news is all hatom, the events are all hcalendar, and the personal and organization information is all hcard. You can run my greasemonkey script and import the events into Google Calendar. You can run the hcards through Brian Suda's X2V and get them into your address book. You can use Chris Casciano's NetNewsWire script to subscribe to the news without bothering with a separate feed (although there is a separate feed too).
And who is going to do these things? I expect absolutely no one. Certainly no one I know of using the site. So why do I bother? I don’t know. I don’t know why I like data so much. I don’t know why people like pretty things so much. Maybe some day I’ll figure it all out. Meanwhile, I make websites.
Google hCalendar is a Firefox Greasemonkey script I made. It looks for pages with vevents and inserts a button to add each found event to Google Calendar.
I'm still working out some time zone oddities Apparently many of the sites using hCalendar have improper time zone markup (e.g. every event is marked as UTC-7 at Upcoming.org), but it otherwise seems to work fine. Now I'm looking forward to my free book. Oddly enough, I'm actually working on another project right now in exchange for free magazines. You can keep your attention economy; I'm going back to bartering. Will code for interesting reading.
I took a stab at adding hAtom markup to the Microformat Base (and prettied it up a bit). I'm not sure if it's valid hAtom, because there's not yet anything to validate hAtom. But it's valid XHTML and it's structured information, so it can be parsed to syndicate this data.
For example, someone could (and should) subscribe to all Microformat Base events for 2006, and run each new page through lifelint to generate RDF or iCal files, which can then be combined to create a yearly calendar. Different calendars could be generated from different searches, and you could even pull tags out of the pages, lookup the tags on flickr, and use the related photos as monthly calendar images for printable calendars.
All the data is there, structured, waiting to be parsed and used for something interesting.
The launch of Google Base inspired a bit of armchair quarterbacking about how Google might have done it differently. One suggestion, popular - of course - among the microformats community, was that Google could use microformats to remove the need for submission to their base and leverage the distributed nature of the web.
Personally, I suspect there's just not enough microformatted content out there yet to make it worth Google's cycles parsing it. Lucky for me, my own parsing cycles aren't so valuable. Microformat Base is my attempt at a microformat-based alternative to Google Base. It's slowly crawling the web looking for microformatted content, and adding it to a structured database, searchable by microformat class names. There are plenty of improvements to be made, but it's already functional in the most basic form. You can find several vcards for people named Tantek, for example.
If anyone's interested, it's open source and will eventually be open data in some form or another. I'm not looking to start a new public search engine — just demonstrate that someone with more time and experience than I and maybe an existing web crawler (*cough cough*) could do something like this. I suspect a decent search engine would inspire more microformatting, and may prove the best way to work around the chicken-egg adoption problem microformats currently face. Until someone else builds it better, I'll keep tweaking Microformat Base to that end.
I gather most people involved in microformats are coming from a background heavy in more formally structured data, e.g. RDF, XML, relational databases. I'm coming more from the opposite background: scraping. Recently Phil Jones described a web in which metadata resides in scraping/parsing applications meaning documents need not be so descriptive, and Danny Ayers predictably responded with an argument for the Semantic Web, in which metadata resides in documents meaning applications need not be so smart.
In Danny's comments, I tried to point out the applications Phil predicts can produce the documents Danny predicts. I already do a small amount of this with all my scrapers. On Disemployed, I add location and time information to each job posting and publish that information in a regular format (HTML, RSS 1, RSS 2, or Atom). I could admittedly be structuring this information more formally to better encourage reuse, but the data is there, in any case, where it wasn't before. But this is relatively simple data to add. I know when I found each job post and where it came from, so my application doesn't need to be very smart. What are the limits of a smart application? Could a very crafty application actually make microformats unnecessary?
Let's take one microformat, hCard, and see how guessable the microformat metadata would be if it weren't there, on a scale of zero to ten:
- fn (full name): this could at best be a guess. A name could feasibly contain pretty much any combination of letters. I'm sure someone somewhere has named a child "Asdf Jkl." Microformats are the easiest way to identify fn. 0/10.
- n (name): same here. 0/10.
- nickname: again, no easy way out. 0/10.
- photo: here we have a winner, mostly. I'm guessing eight times out of ten, any image referenced within something identified as hCard information will be a photo. Depending on how lucky we feel, microformats could be dispensed with here. 7/10.
- bday (birthday): this is a bit complicated. Dates follow very standard formats, and we could probably identify dates in a jumble of text with about 95% accuracy. But how do we know if a given date is a birthday? We can assume relatively safely based on proximity to words like "birthday." 9/10.
- adr (address): I would have guessed this would be very hard to identify as a pattern, but Google is already doing this. Of course, Google is limiting to US addresses. 5/10.
- label: at first, this appears to be as open-ended as names, but the variety in practice is likely very limited. I would expect a list of a few dozen words likely to occur in a label (e.g. home, domestic, etc.) would catch maybe 7/10.
- tel (telephone): this is a bit complicated. Having an address makes it much easier to tell if a given set of numbers is likely a phone number. Capturing anything that fits the patterns (###) ###-#### or ###-###-#### would get many phone numbers, and I suspect more is possible. 6/10.
- email: This one is easy. An email address must fit a defined pattern, so we can discover all email addresses with no microformat, as evidenced by the proliferation of junk email. 10/10.
- mailer: At any given time, there are only so many known email clients. 8/10.
- tz (time zone): There are only so many timezones, and not too many ways to describe them. 9/10.
- geo: Latitude and Longitude information is pretty much useless if it doesn't follow a certain pattern (decimal numbers between -180 and 180), but that doesn't mean all numbers that follow this pattern are geo codes. 6/10.
- title: Theoretically unlimited, but practically limited. 7/10.
- role: Words ending in "er" would catch a lot. Check for proximity to words like "job," "work," or "professional." 5/10.
- logo: Just like photo, only probably smaller. 7/10.
- agent: I had to look this one up. Auto-discovery doesn't look good. 0/10.
- org: Just like names, only worse. 0/10.
- categories: Could be anything. 0/10.
- note: Again, anything. 0/10.
- rev: Dates near words like "updated" or "modified." 7/10.
- sort-string: Usually last word in the name. 6/10.
- sound: Sounds have defined formats. 10/10.
- uid: Pass. 0/10.
- url: First standard link. 7/10.
- class: Pass. 0/10.
- key: Keys follow patterns. 10/10.
Average: 5.3/10. In general there are some areas in which microformats are entirely unneccesary, some in which they are entirely necessary, and some in between. Of course, these are mostly rough estimates on the potential accuracy of intelligent scraping. The actual accuracy would need to be determined by writing a scraper and pitting it against some actual data.
In any case, microformats appear well worth the expense to capture that 47% (or however much) of the existing information. Even though email addresses are entirely identifiable without any microformat, as long as we're wrapping names in name tags, it makes sense to wrap the email addresses at the same time so a parser doesn't need to be any smarter.
While not the absolute simplest method, microformats appear to be the lowest common denominator of structured documents. So now I think I was wrong when I wrote that
we're headed towards a "semantic web" in which the semantics are forced onto websites by browsers and other intermediaries. I still expect that will happen (as I notice it happening, and cause it to happen), but given the practical limits of the smart-application method of connecting the world of information, it will only work as a bridge to a semantic web composed of metadata-rich documents.
I recently worked on a website for the Iowa Military Veterans Band for my day job. It's a static site, which is not my primary interest. Making static websites is more interesting to me than most other tasks, but I'd much rather be working on something dynamic and functional. So I made the site functional in ways no one will ever use.
If you take any page with contact or calendar information from the IMVB site and feed it into X2V, you'll get the relevant information as vCard or iCal, which you can then import into most address book and calendar applications. Which admittedly seems pretty useless at first given the unlikelihood that anyone would want to import such information into a desktop application.
I did this mostly to test out the usefulness of microformats. I had been reading about microformats for a few weeks, so I thought this would be a good opportunity to try it out. I probably spent about 20 minutes extra time adding and testing the microformats, which is relatively little given the enormous time savings for the first person who wants to import all ninety-some IMVB members into her address book.
And this is only what can be done with microformats today. I imagine a future in which X2V is unnecessary because microformat readers are built into browsers. Where Safari and Firefox today recognize syndication feeds and allow users to import that information into a suitable application with a single click, future browsers could do the same with various microformats.
Unfortunately, this future will likely be slow coming, as microformats suffer from the same chicken-egg problem that made syndication adoption so slow: nobody wants a reader application with no content, and nobody cares to produce content with no readers. But because microformats are starting mostly with existing formats like vCard and iCal (and soon Atom), perhaps the future won't be so slow to arrive. In any case, I've done my part to spread microformats and create a more semantic web, and I see no reason not to continue doing so in the future.