Scraper fails on files over ~390KB

与世无争的帅哥 提交于 2019-12-11 03:58:02

问题


Does the Facebook's URL scarper have a size limitation on it? We have several books available on a website. Those that have an HMTL filesize under a certain size (~390KB) get scraped and read properly but the 4 that are larger do not. These larger items get a 200 response code and the canonical URL opens.

All of these pages are built using the same template, the only differences being the size of the content within each book and the number of links each book makes to other pages on the site.

  1. click on canonical URL
  2. Open Firebug In Firefox or developer tools in Chrome to network tab 3, The *.html size at >~390KB for the listed failures & <~390K for the successes
  3. Click on "See exactly what our scraper sees for your URL"
  4. Blank page for failures, HTML present for successes

Failures:

  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftapom.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftbgpu.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fttjc.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftbdse.html

Successes:

  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fthogtc.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Faabibp.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftww.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftsosw.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fsyottc.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fttigtio.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Faadac.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fsiud.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftuyc.html

回答1:


A solution for your problem might be to check whether a real user or the Facebook bot is visiting your page. If it is the bot, then render only the necessary meta data for it. You can detect the bot via its user agent which according to the Facebook documentation is:
"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"

The code would look something like this (in PHP):

function userAgentIsFacebookBot() {
    if ($_SERVER['HTTP_USER_AGENT'] == "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)") {
        return true;
    }
    return false;
}



回答2:


Are you sure this isn't a problem on your side? Last time I checked the scraper requested only the first 4096 bytes of the document, which should always be ample space to retrieve the <head></head> section with the meta tags



来源:https://stackoverflow.com/questions/11915087/scraper-fails-on-files-over-390kb

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!