Facebook and Crawl-delay in Robots.txt?

前端未结

关注

 5  905

旧时难觅i 2021-01-02 03:39

Does Facebook\'s webcrawling bots respect the Crawl-delay: directive in robots.txt files?

5条回答

情深已故 (楼主)

2021-01-02 04:39

For a similar question, I offered a technical solution that simply rate-limits load based on the user-agent.

Code repeated here for convenience:

Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.

In PHP, execute the following code as quickly as possible for every request.

define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit

if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && preg_match( '/^facebookexternalhit/', $_SERVER['HTTP_USER_AGENT'] ) ) {
    $fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
    if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
        $lastTime = fread( $fh, 100 );
        $microTime = microtime( TRUE );
        // check current microtime with microtime of last access
        if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
            // bail if requests are coming too quickly with http 503 Service Unavailable
            header( $_SERVER["SERVER_PROTOCOL"].' 503' );
            die;
        } else {
            // write out the microsecond time of last access
            rewind( $fh );
            fwrite( $fh, $microTime );
        }
        fclose( $fh );
    } else {
        header( $_SERVER["SERVER_PROTOCOL"].' 503' );
        die;
    }
}

0 讨论(0)

查看其它5个回答