Website using DataDome gets captcha blocked while scraping using Selenium and Python

前端 未结 2 1622
猫巷女王i
猫巷女王i 2020-12-12 04:10

I\'m actually trying to scrape some car datas from different websites, i\'ve been using selenium with chromebrowser but some websites actually block selenium with captcha va

相关标签:
2条回答
  • 2020-12-12 04:54

    A bit more details about your usecase on scraping car datas from different websites or from https://www.leboncoin.fr/ would have helped us to construct a more canonical answer. However, I was able to access the Page Source using Selenium as follows:

    • Code Block:

      from selenium import webdriver
      
      options = webdriver.ChromeOptions() 
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
      driver.get('https://www.leboncoin.fr/')
      print(driver.page_source)
      
    • Console Output:

      <html class="gServer"><head><link rel="preconnect" href="//fonts.googleapis.com" crossorigin=""><link rel="preload" href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700&amp;display=swap" crossorigin="" as="style"><link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@400;600;700&amp;display=swap" crossorigin=""><style data-emotion-css=""></style><meta charset="utf-8"><link rel="manifest" href="/manifest.json"><link type="application/opensearchdescription+xml" rel="search" href="/opensearch.xml"><meta name="theme-color" content="#ff6e14"><meta property="og:locale" content="fr_FR"><meta property="og:site_name" content="leboncoin"><meta name="twitter:site" content="leboncoin"><meta http-equiv="P3P" content="CP=&quot;This is not a P3P policy&quot;"><meta name="viewport" content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=0"><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://tp.realytics.io/sync/se/cnktbDNiMG5jb3xyeV83NTFGRUQwMy1CMDdGLTRBQTgtOTAxRi1DNUREMDVGRjkxQTJ8?ct=1&amp;rt=1&amp;u=https%3A%2F%2Fwww.leboncoin.fr%2F&amp;r=&amp;ts=1591306049397"></script><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-766292687&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-667462656&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" src="https://cdn-eu.realytics.net/realytics-1.2.min.js"></script><script type="text/javascript" async="" src="https://i.realytics.io/tc.js?cb=1591306047755"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=DC-4167650&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-744431185&amp;l=dataLayer&amp;cx=c"></script><script type="text/javascript" async="" charset="utf-8" src="//www.googleadservices.com/pagead/conversion_async.js" id="utag_82"></script><script type="text/javascript" async="" charset="utf-8" src="//sdk.mpianalytics.com/pulse.min.js" id="utag_47"></script><script async="true" type="text/javascript" src="https://sslwidget.criteo.com/event?a=50103&amp;v=5.5.0&amp;p0=e%3Dexd%26site_type%3Dd&amp;p1=e%3Dvh&amp;p2=e%3Ddis&amp;adce=1&amp;tld=leboncoin.fr&amp;dtycbr=6569" data-owner="criteo-tag"></script><script type="text/javascript" src="//try.abtasty.com/09643a1c5bc909059579da8aac99e8f1.js"></script><script>window.dataLayer = window.dataLayer || [];
      .
      .
      .
      <iframe height="1" width="1" style="display:none" src="//4167650.fls.doubleclick.net/activityi;src=4167650;type=slbc01;cat=all-site;u1=homepage;ord=9979622847645.51?" id="utag_179_iframe"></iframe></body></html>
      

    However, it's quite evident from the DOM Tree that the website is protected from Bad Bots through DataDome as in:


    DataDome

    The key features are as follows:

    • DataDome is the only bot protection solution delivered as-a-service.
    • DataDome requires no architecture changes or DNS rerouting.
    • DataDome's bot detection engine compares every request to the website with a massive in-memory pattern database, and uses a blend of AI and machine learning to decide in less than 2 milliseconds whether access to your pages should be granted or not.
    • DataDome detects and identifies 100% of OWASP automated threats.
    • DataDome's Custom Rules function can even allow you to block human traffic from countries you are not selling to, or to allow partner bots to access your site only in specific circumstances.

    Outro

    Documentation on DataDoe can be found at:

    • Bot detection
    • Server-side bot detection is not enough
    0 讨论(0)
  • 2020-12-12 05:13

    It could be happening due to a myriad of reasons. Try going through the answer here that gives someway in you can prevent this problem.

    A simple solution that worked for me sometimes is to use Waits/Sleep calls in selenium, see here from the docs about Waits. Or sleep calls can be done like so

    Import time
    time.sleep(2)
    
    0 讨论(0)
提交回复
热议问题