Saturday, March 11, 2006

RSS & atom

As I mentioned in a previous post, spiders (e.g. googlebot) crawl blogs differently than "regular" webpages. I didn't go into much detail, because I didn't actually know much about the process. Later I began fumbling around with Google Sitemaps which resulted in temporarily being blacklisted as a splog. (Incidentally, Blogspot corrected the problem promptly, much to my surprise.)

After using some online sitemap generators, which weren't entirely satisfactory, I stumbled onto Johannes Mueller's excellent GSiteCrawler. Whenever that program uploads a new sitemap, it pings Google Sitemaps, alerting them that new content is available. Aha! This two-way communication is the essence of web feed technology.

Suppose you have your own spider, dutifully traversing the web day in and day out. Obviously, you could cut down on the bandwidth it would require if it didn't have to crawl every page, but just the pages which had changed. Googlebot is perfectly capable of traversing a site, but employs its own algorithm to decide whether or not to re-index a particular page, thus driving webmasters insane.

Sitemaps don't really change this, but properly used, they give googlebot a "heads up" as to what contnet is new. New content is preferred over old content by some secret amount, so frequently updating sitemaps is the key to their successful use. If you don't have enough room for sitemaps, similar benefits can be had simply by including (properly maintained) "date" metatags for each page.

RSS and atom (remember them?) are two competing file specifications for this type of two-way communication, commonly referred to as web syndication. Presently the "standards" are under nearly constant revision, similar to the "browser wars" of a few years ago that made it practically impossible to use javascript. This site's web-feed is: http://wholeed.blogspot.com/atom.xml

Tuesday, March 07, 2006

Blogs Blogs Blogs

Googlebot can't seem to figure out what blog is about, so it's loading the pages up with "How to do a blog" AdSense ads. Swell.

Very well then, herewith is a short list of free blogspace providers:

The Robots Exclusion Protocol

Well-behaved 'bots comply with Robots Exclusion Protocol, which allows webmasters to specify pages or whole directories (Windows users read folders) that should not be crawled or indexed.

Most webmasters are moving heaven and earth to be included in various search-engines, so the default behavior -- "index, follow" -- is correct. Search-engines, on the other hand usually have very restrictive robots.txt files.

In part, this is to reduce the incredible traffic that would be generated by all the myriad search-engines crawling each other's sites. Another, probably more compelling reason is that a search-engine's database is its stock in trade, and most are very proprietary about sharing them with others

A sampling of robots.txt files from several prominent sites is instructive for those wishing to "tweak" their own robots.txt files:

In the course of compiling this list we learned that Teoma was recently acquired by Ask (Formerly "Ask Jeeves"). Now if somebody would only buy Yahoo! ...

Saturday, February 25, 2006

Pop-Culture-Palooza

After about a month of twiddling, it has become obvious that a Blog needs to be promoted like any other site for Google AdSense to make any money. Therefore, I've started what I hope will be an immensely popular Blog pandering to the least common denominator in the internet search universe, and called it Pop-Culture-Palooza. (Yes, I'm sure "Palooza" is sooo '80s, but that's the inside joke.)

This site, which is public of course, and indexed won't be promoted much, being more or less a "what's new" vehicle for other sites. Much like the old "Radio Free Huntsville" page, it may or may not "get off the ground." Generally, I wind up trying to put out fires and update links a lot more than I do chronicling the same.

It also has occurred to me that my ongoing battles with googlebot would be more appropriate for a SEO service, in which case they'd be "trade secrets." For now, at least, I'm going to "save them for the book." Those canny enough to "read between the lines" may still find my ineffectual struggles amusing

As always Read!