| Frames | Modems | Help | Home Page | Chipsets | Search | No Frames |
| Diary Entries | See also Site Info & Diary. | |
| 26 May 2004 | Block those bots! | |
There are lots of bots (automated ro-bots, also often called ‘spiders’ since they crawl over the web) active on the internet, downloading the content of websites. Many are benign and useful, some are badly designed, others are a positive menace. This little section is to give the PHP code for a Badly-Behaved Bot Prevention Routine that was implemented yesterday and proved it’s value just hours later by frustrating some Dutch prat that was trying to scrape Modem-Help at a thousand pages a minute. Google currently shows Modem-Help to have about 10,100 pages. It has taken 5 years of intensive work to get the site to it’s current position and, quite apart from the irritation that someone would want to rip that off in 10 minutes, just one uncontrolled bot can bring the server to it’s knees, shut out humans trying to legitimately browse the site, plus send my bandwidth costs soaring. This BBBP routine (hurrah! another acronym!) is transparent to both well-behaved bots and normal browsing, and has the wonderful side-effect of a Smug Smile of Satisfaction (SSS) at being able to give the finger to all those rip-off merchants. |
||
The routine is drawn from webmasterworld.com (they have had to go pay-me, so the link goes via Google). My additions are principally to stop the log file growing forever, plus giving a separate, reporting routine. The code is tested and working. Here is the code: // -------------- Start blocking badly-behaved bots -------
$itime = 10; // Check interval, seconds
$imaxvisit = 61; // Maximum visits
$ipenalty = 60; // Seconds before visitor is allowed back
$iplogfile = _B_DIRECTORY._B_LOGFILE;
$ipfile = _B_DIRECTORY.substr( md5($_SERVER["REMOTE_ADDR"]), -2 );
$time = time();
$oldtime = ( file_exists( $ipfile ))
?filemtime( $ipfile )
:0;
if( $oldtime < $time ) { $oldtime = $time; }
$newtime = $oldtime + $itime;
if( $newtime >= $time + $itime*$imaxvisit ) {
touch( $ipfile, $time + $itime*( $imaxvisit-1 ) + $ipenalty );
header("HTTP/1.0 503 Service Temporarily Unavailable");
header("Connection: close");
header("Content-Type: text/html");
echo "<html><body><p><b>Server under heavy load</b><br />";
echo "More than $imaxvisit visits from your IP-Address within the last $itime secs.
Please wait $ipenalty secs before retrying.</p></body></html>";
if( $fp = fopen( $iplogfile, "a" )) {
$useragent = ( isset( $_SERVER["HTTP_USER_AGENT"]))
?$_SERVER["HTTP_USER_AGENT"]
:"<unknown user agent>";
$logline = $_SERVER["REMOTE_ADDR"]." ".
date("d/m/Y H:i:s")." ".$useragent."\n";
$log = file( $iplogfile );
if( count( $log ) >= _B_LOGMAXLINES ) { // otherwise grows like Topsy
fclose( $fp );
array_shift( $log );
array_push( $log, $logline );
$logline = implode("", $log);
$fp = fopen( $iplogfile, "w" );
}
fputs( $fp, $logline );
fclose( $fp );
}
exit();
}
touch( $ipfile, $newtime );
// -------------- Stop blocking badly-behaved bots --------
The routine wants to appear on every single PHP page, near the top. There are a number of ways to do this (left as an exercise for the reader). _B_DIRECTORY and _B_LOGFILE are define()s also used in the reporting routine (below). _B_LOGMAXLINES here is also a define, but could be placed as another variable at the top of the routine. $itime, $imaxvisit & $ipenalty together control the blocking behaviour of the routine. _B_DIRECTORY is the directory where $ipfile & $iplogfile will be stored. As always, place this directory below the root of the web-pages so that it cannot be viewed from the internet, and chmod 0777 it so that PHP can both read and write files into it. The directory name needs to be the full machine (not internet) directory address together with a trailing slash. As written, the routine will create a maximum of 256 zero-byte $ipfiles and one $iplogfile on the fly. So far (36 hours), this works fine on a twin-Xeon machine currently serving an average of 18,377 php-pages each day. Just one clear-scumbag was blocked, and the site is crawled by Google and other spiders every day. [31 May update: after 6 days this becomes 4 scumbags blocked (300 total log-lines, representing 300 blocks across a total of 45 seconds for 4 IP-addresses), and no problems for legitimate spiders or ordinary visitors; if you own a php-site do not hesitate! implement this code NOW]. _B_LOGFILE is the name of the logfile used to store details of rogue bots. It will never be more than _B_LOGMAXLINES lines long (for me, 1,000 lines). $itime & $imaxvisit (currently set at 10 seconds and 61 visits). This translates as 366 pages a minute from a single IP-address (or 93,696 per minute from the entire internet if you want to be awkward). Although I cannot now find the page, Google has previously stated that their bots crawl at a maximum of 6 pages a second. This is pretty fast! but I chose the settings to try to make sure that I do not block Google or other well-behaved bots. Extremely busy sites may need to increase the number of $ipfiles (change the digit ‘2’ to ‘3’). |
||
My instinct and desire is always to check what is happening, and particularly with a new feature, so here is the logfile-reporting routine used on my maintenance index file. You will find the hP() and other HTML functions at the 08 Nov 2002 diary page, or write your own. Here is the code: $log = file( _B_DIRECTORY._B_LOGFILE );
$tot = count( $log );
$i = -1;
rsort( $log );
while( list( $lineNumber,$lineText ) = each( $log )) {
if( $lineNumber>0 && strstr( $log[ $lineNumber-1 ],$lineText )) { // repeated line
$lineCount++;
} else {
$i++;
$lineCount = 1;
}
$display[ $i ] = array( $lineText,$lineCount );
$temp = explode( " ",$lineText );
$ip = $temp[ 0 ];
$ips[ $ip ] = ( isset( $ips[ $ip ]))
?$ips[ $ip ] + 1
:1;
}
$rows = ""; foreach( $ips as $ip => $lines) {
$rows.= hLI("$ip [ <i>".gethostbyaddr( $ip )."</i> ] $lines line(s)");
}
$Rows = ""; foreach( $display as $line) {
list( $text,$lines ) = $line;
$Rows.= hTR( hTD( $text ).hTD( $lines,1,1,0,0,"","","","","num"));
}
$botContent.=hDiv(
hH("Block Log\n",3).
hP("Blocked IPs:").
hUL( $rows ).
hP("$tot total lines in log-file.").
hTable(
hTHead(
hTR(
hTH("Log Line "._T_DESC).
hTH("Lines")
),"","middle"
).
hTBody($Rows),
"",
"",
"",
0,
"specs"
),"","bottomContent"
);
Below is what this looks like, plus immediately below is a daily-snapshot from my bandwidth page; you can clearly see the blue spike at 05:20 uploading the routines, and a green spike at 13:39 as the scumbag got stuck at 61 PHP-pages + 183 go-away pages. Yeah!
|
||
Blocked IPs:
183 total lines in log-file.
|
Log Line |
Lines |
|---|---|
| 62.194.24.133 25/05/2004 13:39:39 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 15 |
| 62.194.24.133 25/05/2004 13:39:38 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 13 |
| 62.194.24.133 25/05/2004 13:39:37 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 16 |
| 62.194.24.133 25/05/2004 13:39:36 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 14 |
| 62.194.24.133 25/05/2004 13:39:35 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 16 |
| 62.194.24.133 25/05/2004 13:39:34 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 18 |
| 62.194.24.133 25/05/2004 13:39:33 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 16 |
| 62.194.24.133 25/05/2004 13:39:32 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 18 |
| 62.194.24.133 25/05/2004 13:39:31 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 15 |
| 62.194.24.133 25/05/2004 13:39:30 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 16 |
| 62.194.24.133 25/05/2004 13:39:29 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 17 |
| 62.194.24.133 25/05/2004 13:39:28 Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) | 9 |