How to deal with the captcha when doing Web Scraping in Puppeteer?

前端 未结 3 1743
执笔经年
执笔经年 2020-12-03 04:17

I\'m using Puppeteer for Web Scraping and I have just noticed that sometimes, the website I\'m trying to scrape asks for a captcha due to the amount of visits I\'m doing fro

3条回答
  •  没有蜡笔的小新
    2020-12-03 04:47

    You should use combination of following:

    • Use an API if the target website provides that. It's the most legal way.
    • Increase wait time between scraping request, do not send mass request to the server.
    • Change/rotate IP frequently.
    • Change user agent, browser viewport size and fingerprint.
    • Use third party solutions for captcha.
    • Resolve the captcha by yourself, check the answer by Thomas Dondorf. Basically you need to wait for the captcha to appear on another browser, solve it from there. Third party solutions does this for you.

    Disclaimer: Do not use anti-captcha plugins/services to misuse resources. Resources are expensive.


    Basically the idea is to use anti-captcha services like (2captcha) to deal with persisting recaptcha.

    You can use this plugin called puppeteer-extra-plugin-recaptcha by berstend.

    // puppeteer-extra is a drop-in replacement for puppeteer,
    // it augments the installed puppeteer with plugin functionality
    const puppeteer = require('puppeteer-extra')
    
    // add recaptcha plugin and provide it your 2captcha token
    // 2captcha is the builtin solution provider but others work as well.
    const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha')
    puppeteer.use(
      RecaptchaPlugin({
        provider: { id: '2captcha', token: 'XXXXXXX' },
        visualFeedback: true // colorize reCAPTCHAs (violet = detected, green = solved)
      })
    )
    

    Afterwards you can run the browser as usual. It will pick up any captcha on the page and attempt to resolve it. You have to find the submit button which varies from site to site if it exists.

    // puppeteer usage as normal
    puppeteer.launch({ headless: true }).then(async browser => {
      const page = await browser.newPage()
      await page.goto('https://www.google.com/recaptcha/api2/demo')
    
      // That's it, a single line of code to solve reCAPTCHAs 

提交回复
热议问题