Clean URLs

URLs are Uniform Resource Locators. They're the addresses for stuff on the web. They commonly start with "http://" and beyond that, they range widely in format. Uniform Resource Locators aren't very uniform. Part of the lack of uniformity comes from having multiple URLs pointing to the same resource.

To see why this is a problem, you can take a trip to Des Moines and try to find my house on 16th Street. You may end up on 16th Street in West Des Moines, or South East 16th Street in Des Moines, but mine is the one that intersects with Crocker, just off Martin Luther King. Only Crocker is named Cottage Grove where it meets Martin Luther King, which is also named 19th Street or Fleur at various points on the same street.

If you find my street, and then my house, you'll still have some trouble, as I live in a duplex, with two entrances in both the front and the back. I could tell you I live on the left side, but that may not be your left when you're standing in front of (or behind) the house. This would all be a lot easier if you could just go to http://www.scottshouse.com/. (Actually, you can, if you want to order some flowers, but that won't help you get here.)

Ideally, every resource on the web would have a single URL. Apple.com is good at working toward this goal. If you go to http://www.apple.com/tiger/ or http://www.apple.com/macos/, you will end up at http://www.apple.com/macosx/. Other sites are not so good at this. An interesting auxiliary benefit of URL-based tagging sites like del.icio.us is that we can easily see when a single resource has multiple URLs pointing at it. For example, at this moment, the del.icio.us popular page has three different URLs listed for the exact same article on slashdot.

This isn't a problem if the only site you visit is slashdot, for the same reason I don't have trouble finding my house. But if you're out wandering the web, and you come across a link to one of these URLs, and you follow it, and a day later you come across a different link to a slightly different URL, you will not have the visual cue most browsers and websites offer to tell you that you've already followed this link and seen this content, so you'll click it again and waste precious seconds of your life.

Many web developers may not particularly care about a random user roaming around the internet. But it turns out, as Shelley recently pointed out, that Google creates its index of websites by acting as a random user roaming around the internet. When Google happens upon your second or third URL pointing to the same content it starts to think "hmm...maybe this site is just spamming the search index with the same content over and over again." If you have a Google rank as high as slashdot's, Google will quickly dismiss this suspicion, but you probably don't want Google ever wondering if your site is spamming the search index. Not even (or maybe especially) if you are spamming the search index.

The irony is that smaller sites can least afford Google's suspicion or visitor confusion, but smaller sites can also least afford to clean URLs. One of the most useful tools in URL cleaning is Apache's mod_rewrite, yet few smaller sites have access to mod_rewrite's URL cleaning power. For those who do have access to mod_rewrite, along with a healthy (unhealthy?) knowledge of regular expressions, the task of cleaning URLs is relatively quick and easy.

I assume the creators of slashdot have both the access and the know-how to clean their URLs, so they're easy targets for finger pointing. However, I also have both the access and the know-how to clean my URLs, and you'll notice no shortage of cruft around these parts. So this is as much a self service as a public service announcement. Self and public: clean your URLs.

Be number 1:

 
 
 
knows half of 8 is