Every few days, a crawler from Inktomi Corporation (maker, I believe, of Yahoo’s search engine) visits this site and downloads hundreds of pages, consuming a large share of my bandwidth. It hasn’t yet gobbled so much as to shut me down, but it is coming nearer and nearer. Here is how Sitemeter recorded one of its recent depredations:
Domain Name yahoo.com ? (Commercial)
IP Address 68.142.211.# (Inktomi Corporation)
ISP Inktomi Corporation
Location
Continent : North America
Country : United States
State : California
City : Sunnyvale
Lat/Long : 37.1894, -121.7053 (Map)
Language English (United States)
Operating System FreeBSD Unknown
Browser Default 3.2
YahooSeeker/CafeKelsa (compatible; Konqueror/3.2; FreeBSD ;
http://help.yahoo.com/ help/hotjobs/webmaster) (KHTML, like Gecko)
Javascript enabled
Monitor Resolution : 1280 x 1024
Color Depth : 4 bits
Time of Visit Oct 14 2005 3:50:05 am
Last Page View Oct 14 2005 9:52:00 am
Visit Length 6 hours 1 minute 55 seconds
Page Views 353
Referring URL [blank]
Visit Entry Page http://stromata.type...pharmaceutical_.html
Visit Exit Page http://stromata.type...searching_for_m.html
Time Zone UTC-9:00
YST - Yukon Standard Time
AKST - Alaska Standard Time
AKDT - Alaska Standard Daylight Saving Time
Visitor's Time Oct 14 2005 1:50:05 am
Typepad tried to put a stop to these shenanigans with a robots.txt file, which reads,
User-agent: *
Disallow: /t/comments
Disallow: /t/stats
Disallow: /t/app
# weird MSIE thing that keeps hammering
User-agent: Active Cache Request
Disallow: *
I’ve also tried denying access to the Yahoo spiders “slurp” and “yahooseeker”. Nothing has done any good. I sent e-mail to Yahoo and got a response stating that their crawler obeys robots.txt files and that they have detected no abnormal behavior.
So, in desperation, I ask whether anybody who stumbles across this site has any ideas about the cause of or, better, cure for this malady. I’m beginning to wonder whether there’s a cause of action for theft of bandwidth or negligent trespass to caches.
Create a blank file called "robots.txt" and put a copy of it in any directory that you do not wish to be slurped, gobbled or otherwise indexed.
Point of fact: if you place it in the root directory it should take care of the entire site, but it will also keep people from finding you in google (or yahoo, or anyone else that uses spiders).
Posted by: Huw Raphael | Saturday, October 15, 2005 at 12:04 AM