How to avoid being detected as bot on Puppeteer and Phantomjs?

谁都会走 提交于 2020-05-24 11:32:10

问题


Puppeteer and PhantomJS are similar. The issue I'm having is happening for both, and the code is also similar.

I'd like to catch some informations from a website, which needs authentication for viewing those informations. I can't even access home page because it's detected like a "suspicious activity", like the SS: https://i.imgur.com/p69OIjO.png

I discovered that the problem doesn't happen when I tested on Postman using a header named Cookie and the value of it's cookie caught on browser, but this cookie expires after some time. So I guess Puppeteer/PhantomJS both are not catching cookies, because this site is denying the headless browser access.

What could I do for bypass this?

// Simple Javascript example
var page = require('webpage').create();
var url = 'https://www.expertflyer.com';

page.open(url, function (status) {
    if( status === "success") {
        page.render("home.png");
        phantom.exit();
    }
});

回答1:


The website you are trying to visit uses Distil Networks to prevent web scraping.

People have had success in the past bypassing Distil Networks by substituting the $cdc_ variable found in Chromium's call_function.js (which is used in Puppeteer).

For example:

function getPageCache(opt_doc, opt_w3c) {
  var doc = opt_doc || document;
  var w3c = opt_w3c || false;
  // var key = '$cdc_asdjflasutopfhvcZLmcfl_';    <-- This is the line that is changed.
  var key = '$something_different_';
  if (w3c) {
    if (!(key in doc))
      doc[key] = new CacheWithUUID();
    return doc[key];
  } else {
    if (!(key in doc))
      doc[key] = new Cache();
    return doc[key];
  }
}

Note: According to this comment, if you have been blacklisted before you make this change, you face another set of challenges, so you must "implement fake canvas fingerprinting, disable flash, change IP, and change request header order (swap language and Accept headers)."




回答2:


Things that can help in general :

  • Headers should be similar to common browsers, including :
    • User-Agent : use a recent one (see https://developers.whatismybrowser.com/useragents/explore/), or better, use a random recent one if you make multiple requests (see https://github.com/skratchdot/random-useragent)
    • Accept-Language : something like "en,en-US;q=0,5" (adapt for your language)
    • Accept: a standard one would be like "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8"
  • If you make multiple request, put a random timeout between them
  • If you open links found in a page, set the Referer header accordingly
  • Images should be enabled
  • Javascript should be enabled
    • Check that "navigator.plugins" and "navigator.language" are set in the client javascript page context
  • Use proxies



回答3:


If you think from the websites perspective, you are indeed doing suspicious work. So whenever you want to bypass something like this, make sure to think how they are thinking.

Set cookie properly

Puppeteer and PhantomJS etc will use real browsers and the cookies used there are better than when using via postman or such. You just need to use cookie properly.

You can use page.setCookie(...cookies) to set the cookies. Cookies are serialized, so if cookies is an array of object, you can simply do this,

const cookies = [{name: 'test', value: 'foo'}, {name: 'test2', value: 'foo'}]; // just as example, use real cookies here;
await page.setCookie(...cookies);

Try to tweak the behaviors

Turn off the headless mode and see the behavior of the website.

await puppeteer.launch({headless: false})

Try proxies

Some websites monitor based on Ip address, if multiple hits are from same IP, they blocks the request. It's best to use rotating proxies on that case.



来源:https://stackoverflow.com/questions/51731848/how-to-avoid-being-detected-as-bot-on-puppeteer-and-phantomjs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!