Tuesday, March 07, 2006

The Robots Exclusion Protocol

Well-behaved 'bots comply with Robots Exclusion Protocol, which allows webmasters to specify pages or whole directories (Windows users read folders) that should not be crawled or indexed.

Most webmasters are moving heaven and earth to be included in various search-engines, so the default behavior -- "index, follow" -- is correct. Search-engines, on the other hand usually have very restrictive robots.txt files.

In part, this is to reduce the incredible traffic that would be generated by all the myriad search-engines crawling each other's sites. Another, probably more compelling reason is that a search-engine's database is its stock in trade, and most are very proprietary about sharing them with others

A sampling of robots.txt files from several prominent sites is instructive for those wishing to "tweak" their own robots.txt files:

In the course of compiling this list we learned that Teoma was recently acquired by Ask (Formerly "Ask Jeeves"). Now if somebody would only buy Yahoo! ...

No comments: