Facebook and Crawl-delay in Robots.txt?

前端 未结 5 905
旧时难觅i
旧时难觅i 2021-01-02 03:39

Does Facebook\'s webcrawling bots respect the Crawl-delay: directive in robots.txt files?

5条回答
  •  情深已故
    2021-01-02 04:39

    For a similar question, I offered a technical solution that simply rate-limits load based on the user-agent.

    Code repeated here for convenience:

    Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.

    In PHP, execute the following code as quickly as possible for every request.

    define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit
    
    if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && preg_match( '/^facebookexternalhit/', $_SERVER['HTTP_USER_AGENT'] ) ) {
        $fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
        if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
            $lastTime = fread( $fh, 100 );
            $microTime = microtime( TRUE );
            // check current microtime with microtime of last access
            if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
                // bail if requests are coming too quickly with http 503 Service Unavailable
                header( $_SERVER["SERVER_PROTOCOL"].' 503' );
                die;
            } else {
                // write out the microsecond time of last access
                rewind( $fh );
                fwrite( $fh, $microTime );
            }
            fclose( $fh );
        } else {
            header( $_SERVER["SERVER_PROTOCOL"].' 503' );
            die;
        }
    }
    

提交回复
热议问题