Facebook crawler is hitting my server hard and ignoring directives. Accessing same resources multiple times

后端 未结 8 1668
盖世英雄少女心
盖世英雄少女心 2021-02-05 05:52

The Facebook Crawler is hitting my servers multiple times every second and it seems to be ignoring both the Expires header and the og:ttl property.

In some cases, it is

8条回答
  •  天命终不由人
    2021-02-05 06:32

    After I tried almost everything else with caching, headers and what not, the only thing that saved our servers from "overly enthusiastic" Facebook crawler (user agent facebookexternalhit) was simply denying the access and sending back HTTP/1.1 429 Too Many Requests HTTP response, when the crawler "crawled too much".

    Admittedly, we had thousands of images we wanted the crawler to crawl, but Facebook crawler was practically DDOSing our server with tens of thousands of requests (yes, the same URLs over and over), per hour. I remember it was 40 000 requests per hour from different Facebook's IP addresses using te facebookexternalhit user agent at one point.

    We did not want to block the the crawler entirely and blocking by IP address was also not an option. We only needed the FB crawler to back off (quite) a bit.

    This is a piece of PHP code we used to do it:

    .../images/index.php

    = FACEBOOK_REQUEST_THROTTLE) {
            header("HTTP/1.1 429 Too Many Requests", true, 429);
            header("Retry-After: 60");
            die;
        }
    
    }
    
    // Everything under this comment happens only if the request is "legit". 
    
    $filePath = $_SERVER['DOCUMENT_ROOT'] . $_SERVER['REQUEST_URI'];
    if (is_readable($filePath)) {
        header("Content-Type: image/png");
        readfile($filePath);
    }
    

    You also need to configure rewriting to pass all requests directed at your images to this PHP script:

    .../images/.htaccess (if you're using Apache)

    RewriteEngine On
    RewriteRule .* index.php [L] 
    

    It seems like the crawler "understood this" approach and effectively reduced the attempt rate from tens of thousands requests per hour to hundreds/thousands requests per hour.

提交回复
热议问题