Saturday, March 11, 2006

RSS & atom

As I mentioned in a previous post, spiders (e.g. googlebot) crawl blogs differently than "regular" webpages. I didn't go into much detail, because I didn't actually know much about the process. Later I began fumbling around with Google Sitemaps which resulted in temporarily being blacklisted as a splog. (Incidentally, Blogspot corrected the problem promptly, much to my surprise.)

After using some online sitemap generators, which weren't entirely satisfactory, I stumbled onto Johannes Mueller's excellent GSiteCrawler. Whenever that program uploads a new sitemap, it pings Google Sitemaps, alerting them that new content is available. Aha! This two-way communication is the essence of web feed technology.

Suppose you have your own spider, dutifully traversing the web day in and day out. Obviously, you could cut down on the bandwidth it would require if it didn't have to crawl every page, but just the pages which had changed. Googlebot is perfectly capable of traversing a site, but employs its own algorithm to decide whether or not to re-index a particular page, thus driving webmasters insane.

Sitemaps don't really change this, but properly used, they give googlebot a "heads up" as to what contnet is new. New content is preferred over old content by some secret amount, so frequently updating sitemaps is the key to their successful use. If you don't have enough room for sitemaps, similar benefits can be had simply by including (properly maintained) "date" metatags for each page.

RSS and atom (remember them?) are two competing file specifications for this type of two-way communication, commonly referred to as web syndication. Presently the "standards" are under nearly constant revision, similar to the "browser wars" of a few years ago that made it practically impossible to use javascript. This site's web-feed is: http://wholeed.blogspot.com/atom.xml

Tuesday, March 07, 2006

Blogs Blogs Blogs

Googlebot can't seem to figure out what blog is about, so it's loading the pages up with "How to do a blog" AdSense ads. Swell.

Very well then, herewith is a short list of free blogspace providers:

The Robots Exclusion Protocol

Well-behaved 'bots comply with Robots Exclusion Protocol, which allows webmasters to specify pages or whole directories (Windows users read folders) that should not be crawled or indexed.

Most webmasters are moving heaven and earth to be included in various search-engines, so the default behavior -- "index, follow" -- is correct. Search-engines, on the other hand usually have very restrictive robots.txt files.

In part, this is to reduce the incredible traffic that would be generated by all the myriad search-engines crawling each other's sites. Another, probably more compelling reason is that a search-engine's database is its stock in trade, and most are very proprietary about sharing them with others

A sampling of robots.txt files from several prominent sites is instructive for those wishing to "tweak" their own robots.txt files:

In the course of compiling this list we learned that Teoma was recently acquired by Ask (Formerly "Ask Jeeves"). Now if somebody would only buy Yahoo! ...