Saturday, March 11, 2006

RSS & atom

As I mentioned in a previous post, spiders (e.g. googlebot) crawl blogs differently than "regular" webpages. I didn't go into much detail, because I didn't actually know much about the process. Later I began fumbling around with Google Sitemaps which resulted in temporarily being blacklisted as a splog. (Incidentally, Blogspot corrected the problem promptly, much to my surprise.)

After using some online sitemap generators, which weren't entirely satisfactory, I stumbled onto Johannes Mueller's excellent GSiteCrawler. Whenever that program uploads a new sitemap, it pings Google Sitemaps, alerting them that new content is available. Aha! This two-way communication is the essence of web feed technology.

Suppose you have your own spider, dutifully traversing the web day in and day out. Obviously, you could cut down on the bandwidth it would require if it didn't have to crawl every page, but just the pages which had changed. Googlebot is perfectly capable of traversing a site, but employs its own algorithm to decide whether or not to re-index a particular page, thus driving webmasters insane.

Sitemaps don't really change this, but properly used, they give googlebot a "heads up" as to what contnet is new. New content is preferred over old content by some secret amount, so frequently updating sitemaps is the key to their successful use. If you don't have enough room for sitemaps, similar benefits can be had simply by including (properly maintained) "date" metatags for each page.

RSS and atom (remember them?) are two competing file specifications for this type of two-way communication, commonly referred to as web syndication. Presently the "standards" are under nearly constant revision, similar to the "browser wars" of a few years ago that made it practically impossible to use javascript. This site's web-feed is: http://wholeed.blogspot.com/atom.xml

No comments: