Taming The Bots
Ever wonder how many bots, spiders and crawlers are looking at your site and at what pages ? “Friendly” bots obey the robots.txt file but some bots will ignore the robots.txt and spider your whole site.
Rather then go through the logs daily I wrote a little script which I added to the daily backup script to simplify things
echo “$(date)”> /private/var/www/bts.txt
echo $’\r’ >>/private/var/www/bts.txt
grep “bot” /private/var/log/lighttpd/access.log >> /private/var/www/bts.txt
grep “robbot” /private/var/log/lighttpd/access.log >> /private/var/www/bts.txt
grep “spider” /private/var/log/lighttpd/access.log >> /private/var/www/bts.txt
grep “crawler” /private/var/log/lighttpd/access.log >> /private/var/www/bts.txt
To keep the results private just change
/private/var/www/bts.txt
To
/private/var/log/lighttpd/bts.txt
Here is a real world example
This will allow you to make a decision which bot is misbehaving or eating up too much bandwidth
Bad Bots get put here
/private/etc/lighttpd.conf
Note: You will still see them in the log but they will get a 403 Forbidden message on all pages
This along with the custom 404error.php ( under Security ) will give you a very good idea of who is misbehaving and who needs to be blocked
See the 404 errors on Technoids here
Example of lighttpd.conf blocking bots
$HTTP["useragent"] =~ “YandexBot” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “MLBot” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Wget” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “cURL” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Googlebot-Image” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “ia_archiver” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “duggmirror” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “R6_CommentReader” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Lynx” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “CJNetworkQuality” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Baiduspider” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “MJ12bot” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Sosospider” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “80legs” { url.access-deny = ( “” ) }