Blocking malicious crawlers or scrapers in Apache


Occasionally we see a customer who has a popular website that often gets people trying to crawl it and copy the lot, This has the unfortunate side effect that its hammering the site.

Made worse only by dynamic pages and loops this can literally take down a server in some occasions. Often you can slow them down by putting something in a robot.txt in the DocumentRoot like this

User-agent: *
Crawl-delay: 5

You can even use various geoip blocking techniques and firewalls, though these are harder and more complex.

If you are unlucky then you need to take another form of action. You can manually block these when you see them in the logs, but if you are getting hit by them a lot it may pay automate blocking them.
In one such case a user had taken every option he could to block them, including firewalling entire countries from his server. This is what we ended up resorting to
I put the following code into a script called crawlerblock.sh and ran it on a crontab every 5 minutes.

#!/bin/bash
# This is the threshold they get blocked at
threshold=2500
# logfile to parse
apachelogfiles="/var/www/vhosts/site1.com/statistics/logs/access_log /var/www/vhosts/site2.com/statistics/logs/access_log"
 
if [ ! -f /tmp/cb.txt ];
then
touch /tmp/cb.txt
fi
 
timestamp=$(date)
for logfile in $apachelogfiles ; do
        /bin/cat ${logfile} | /usr/bin/awk '{print $1}' | /usr/bin/sort | /usr/bin/uniq -c | /usr/bin/sort -n | /usr/bin/tail | while read line
        do num=$(echo ${line} | /usr/bin/awk '{print $1}')
        ip=$(echo ${line} | /usr/bin/awk '{print $2}')
        # echo Num ${num} and IP ${ip}
        if [ $num -gt $threshold ];then
                if ! /bin/grep -Fxq ${ip} /tmp/cb.txt
                then
                        echo ${timestamp} detected bot from ${ip} - blocking >>/var/log/messages
                        /sbin/iptables -I INPUT -s ${ip} -j REJECT
                        echo ${ip} >>/tmp/cb.txt
                fi
        fi
        done

done

This script basically searches for anyone who has hit the server over 2500 times in your current log. That number is changeable if you want more or less leeway, and it would be easy to adapt that to ignore local ips or similar (just add in a grep -v 127.0.0.1 in the line under timestamp).

If you used this regularly it would probably help to remove the ip cache from /tmp/cb.txt and save the iptables every now and again.

Let us know if you need this setup at all on your VPS by dropping in an email to support.

Note: this script was made to work on debian based system, may need paths tweaked for other distros