MacMall Blowout Deals - updated every week!

Taming The Bots

Ever wonder how many bots, spiders and crawlers are looking at your site and at what pages ? “Friendly” bots obey the robots.txt file but some bots will ignore the robots.txt and spider your whole site.

Rather then go through the logs daily I wrote a little script which I added to the daily backup script to simplify things

echo “$(date)”> /private/var/www/bts.txt
echo $’\r’ >>/private/var/www/bts.txt
grep “bot” /private/var/log/lighttpd/access.log >> /private/var/www/bts.txt
grep “robbot” /private/var/log/lighttpd/access.log >> /private/var/www/bts.txt
grep “spider” /private/var/log/lighttpd/access.log >> /private/var/www/bts.txt
grep “crawler” /private/var/log/lighttpd/access.log >> /private/var/www/bts.txt

To keep the results private just change
/private/var/www/
bts.txt
To
/private/var/log/lighttpd/bts.txt

Here is a real world example

This will allow you to make a decision which bot is misbehaving or eating up too much bandwidth

Bad Bots get put here

/private/etc/lighttpd.conf

$HTTP["useragent"] =~ “^BadBot” { url.access-deny = ( “” ) }

Note: You will still see them in the log but they will get a 403 Forbidden message on all pages

This along with the custom 404error.php ( under Security ) will give you a very good idea of who is misbehaving and who needs to be blocked

See the 404 errors on Technoids here

Example of lighttpd.conf blocking bots

$HTTP["useragent"] =~ “YandexBot” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “MLBot” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Wget” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “cURL” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Googlebot-Image” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “ia_archiver” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “duggmirror” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “R6_CommentReader” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Lynx” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “CJNetworkQuality” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Baiduspider” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “MJ12bot” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “Sosospider” { url.access-deny = ( “” ) }
$HTTP["useragent"] =~ “80legs” { url.access-deny = ( “” ) }

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>