Scraping content from a dynamic webpage with Selenium returns wrong content

本秂侑毒 提交于 2021-02-10 14:55:49

问题


I am trying to print the HTML of https://www.dplay.no/kanaler/ (the webpage is geo restricted so you might have to use https://go.discovery.com/tv-shows/) but it shouldn't matter.

Since the webpage is using JavaScript to load the HTML content I decided to use Selenium with Python 3 to scrape content.

What I have so far is:

from selenium import webdriver

driver = webdriver.Chrome()

driver.get('https://www.dplay.no/kanaler')

html = driver.page_source

print(html)

I have also tried:

html = driver.execute_script("return document.documentElement.outerHTML;")

and

html = driver.execute_script("return document.documentElement.innerHTML;")

However, this does not seem to work because the response I get is not the HTML on the webpage.

How can I get the HTML content that is actually visible on the webpage?


回答1:


You are seeing the right output and correct behavior.

I took your code and added a few options along with some waits and here is the observation:

  • Code Block:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get('https://www.dplay.no/kanaler/')
    time.sleep(10)
    print(driver.page_source)
    
  • Console Output:

      <html lang="no"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width,maximum-scale=10,minimum-scale=1,initial-scale=1"><meta name="google" value="notranslate"><title>Strøm kanaler direkte | Dplay</title><link rel="preconnect" href="https://dplay-static.disco-api.com"><link rel="preconnect" href="https://disco-api.dplay.no"><link rel="preconnect" href="https://eu1-prod-images.disco-api.com"><link rel="preconnect" href="https://connect.facebook.net"><link rel="preconnect" href="https://fonts.googleapis.com"><link rel="preconnect" href="https://assets.adobedtm.com"><link rel="preload" as="script" href="/main-1adbd0ca3d3a7141c1a5.js"><meta name="mobile-web-app-capable" content="yes"><link rel="manifest" href="/manifest.json" crossorigin="use-credentials"><link rel="icon" href="/dplay-logo-180.png"><meta name="apple-mobile-web-app-capable" content="yes"><meta name="apple-mobile-web-app-title" content="Dplay"><meta name="apple-mobile-web-app-status-bar-style" content="white"><link rel="apple-touch-icon" href="/dplay-apple-touch-icon.jpg"><link rel="apple-touch-startup-image" href="/dplay-logo-text-180x75.png"><!-- Facebook App link --><meta property="al:ios:url" content="com.discovery.dplay://facebook"><meta property="al:ios:app_store_id" content="KC4ZD2359Y.com.kanal5.play"><meta property="al:ios:app_name" content="Dplay"><meta property="al:android:url" content="com.discovery.dplay://facebook"><meta property="al:android:package" content="no.dplay"><meta property="al:android:app_name" content="Dplay"><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script src="https://secure.quantserve.com/quant.js" async="" type="text/javascript"></script>
      .
      <script src="https://assets.adobedtm.com/479fbb05b9cf/9fc1a3ab6d1b/76543fb834e9/RCea880b60a90b4cb88872a3ecb52c59e0-source.min.js" async=""></script><script src="https://assets.adobedtm.com/479fbb05b9cf/9fc1a3ab6d1b/76543fb834e9/RC5b307908f85d452bbd1cc58e00201436-source.min.js" async=""></script></head><body><div id="app"><div class="pageContainer-1eCorB4H"><div id="header-wrapper" class="sticky-1FwWG4lU"><header class="header-1l1ildAB"><div class="topHeader-zyhEIsC-"><div class="topContainer-21wWp6Os"><a class="link-_ruDcDB7 logoLink-318yvghE" href="/"><img alt="Dplay" class="logo-3IfpM36Y logo-h00c9h56" src="/a08ed345c0fe04696cf31ab3b87100dc.svg"></a><div class="navWrapper-vwKHbhW_"><div class="nav-10tSiGaY"><a class="link-_ruDcDB7 item-2iwAUPE8 navItem-3wTHBCrm favouritesEnabled-3VQzQJHh" href="/programmer"><div class="navItem-14yB0BB8">Programmer</div></a><a class="link-_ruDcDB7 item-2iwAUPE8 navItem-3wTHBCrm favouritesEnabled-3VQzQJHh" href="/kanaler"><div class="navItem-14yB0BB8">Kanaler</div></a><a class="link-_ruDcDB7 item-2iwAUPE8 navItem-3wTHBCrm favouritesEnabled-3VQzQJHh" href="/tv-guide"><div class="navItem-14yB0BB8">TV-guide</div></a><a class="link-_ruDcDB7 item-2iwAUPE8 navItem-3wTHBCrm favouritesEnabled-3VQzQJHh" href="/sport"><div class="navItem-14yB0BB8">Sport</div></a><a class="link-_ruDcDB7 item-2iwAUPE8 navItem-3wTHBCrm favouritesEnabled-3VQzQJHh" href="/kategorier"><div class="navItem-14yB0BB8">Kategorier</div></a><a class="link-_ruDcDB7 item-2iwAUPE8 navItem-3wTHBCrm favouritesEnabled-3VQzQJHh" href="/gratis"><div class="navItem-14yB0BB8">Gratis</div></a></div><div class="premiumWrapper-3DTdcxSl"><a class="premiumButton-31dbB505" href="/mydplay/products?configName=auth-prod&amp;hostUrl=disco-api.dplay.no&amp;realm=dplayno&amp;returnUrl=https%3A%2F%2Fwww.dplay.no%2Fkanaler%2F" target="_self">Registrer</a></div></div><div class="iconWrapper-3mBB7-5x"><a class="link-ear3kCaw" href="/mydplay/entry/login?configName=auth-prod&amp;hostUrl=disco-api.dplay.no&amp;realm=dplayno&amp;returnUrl=https%3A%2F%2Fwww.dplay.no%2Fkanaler%2F" target="_self"><div class="container-2M8eCiLJ favouritesEnabled-3pfkgJ2m"><span class="label-2g_F1Qvf">Logg inn</span><span class="SVGInline icon-1tqhFCqf icon-hn3OCBQP" style="font-size: 0px;"><svg class="SVGInline-svg icon-1tqhFCqf-svg icon-hn3OCBQP-svg" viewBox="0 0 28 28" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><title>ic_icon_login_default</title><desc>Created with Sketch.</desc><g id="ic_icon_login_default" stroke="none" stroke-width="1"><g id="Login"><rect id="Rectangle" fill="#D8D8D8" opacity="0" x="0" y="0" width="28" height="28"></rect><g id="Group" transform="translate(3.192000, 3.024000)"><path d="M10.7907276,10.976 C5.48646358,10.976 1.06106358,14.738528 0.0376635838,19.740224 C-0.196360416,20.884192 0.690455584,21.952 1.85816758,21.952 L19.7230636,21.952 C20.9033756,21.952 21.7773676,20.865488 21.5375196,19.70976 C20.5024716,14.72324 16.0841836,10.976 10.7907276,10.976 M10.7907276,13.776 C11.7565596,13.776 12.7005516,13.941984 13.5966076,14.269416 C14.4628716,14.585872 15.2652956,15.045296 15.9817036,15.634864 C17.1155916,16.568104 17.9765916,17.79008 18.4745996,19.152 L3.10685558,19.152 C3.60351958,17.793552 4.46115958,16.574544 5.59112758,15.641976 C6.30820758,15.050224 7.11175158,14.589064 7.97947158,14.27132 C8.87709558,13.942656 9.82299158,13.776 10.7907276,13.776" id="Fill-1"></path><circle id="Oval" fill-rule="nonzero" cx="10.808" cy="4.816" r="4.816"></circle></g></g></g></svg></span></div></a>
      .
      <div class="text-1Ey12L6b"><p class="paragraph-3wtxxPuR size2-34rTNEs0">Dplay bruker cookies på nettsiden for å huske dine innstillinger, lage statistikker for å forbedre nettsiden vår, og å gi deg de mest relevante annonsene. Denne informasjonen kan deles med tredjeparter. Ved å fortsette å bruke nettsiden aksepterer du vår bruk av cookies, men du kan når som helst endre denne godkjenningen ved å følge instruksene på vår <a class="" href="https://dplay.no/cookies" rel="noopener" target="_blank">Cookies-side</a>. Her kan du også lese mer om dette</p></div></div><div class="links-2-4rTI9u"></div><button class="button-b4wYudld round-1Ew9jgjq default-vjGITl8z tertiaryCTA-3nF7cF3Z button-2j5j5ldl" type="button"><div class="content-2CZAzoNK"><p class="paragraph-3wtxxPuR text-2iB55dam size3-3bK_JR3k">Ok, jeg aksepterer</p></div></button></div></div><noscript></noscript><noscript></noscript><noscript></noscript><noscript></noscript></dialog><div class="footer-2i64orTD"><footer class="footer-OP_eHgMZ"><div class="container-1KS4F4y4"><div class="base-1JDWzsKS divider-1J9xjEr7"></div><div class="links-3cRELxmJ"><div class="linkAligner-2mmWPhvh"><p class="paragraph-3wtxxPuR paragraph-jt9VMa_X size1-Aclz5TEc"><a class="link-_ruDcDB7" href="/brukervilkaar">Brukervilkår</a></p></div><div class="linkAligner-2mmWPhvh"><p class="paragraph-3wtxxPuR paragraph-jt9VMa_X size1-Aclz5TEc"><a class="link-_ruDcDB7" href="/personvernpolicy">Personvernpolicy</a></p></div><div class="linkAligner-2mmWPhvh"><p class="paragraph-3wtxxPuR paragraph-jt9VMa_X size1-Aclz5TEc"><a class="" href="https://dplayhelp.zendesk.com/hc/no" rel="noopener" target="_blank">Kundeservice</a></p></div><div class="linkAligner-2mmWPhvh"><p class="paragraph-3wtxxPuR paragraph-jt9VMa_X size1-Aclz5TEc"><a class="link-_ruDcDB7" href="/om-dplay">Om Dplay</a></p></div><div class="linkAligner-2mmWPhvh"><p class="paragraph-3wtxxPuR paragraph-jt9VMa_X size1-Aclz5TEc"><a class="link-_ruDcDB7" href="/cookies">Cookies</a></p></div><div class="linkAligner-2mmWPhvh"><p class="paragraph-3wtxxPuR paragraph-jt9VMa_X size1-Aclz5TEc"><a class="link-_ruDcDB7" href="/systemkrav">Systemkrav</a></p></div><div class="linkAligner-2mmWPhvh"><p class="paragraph-3wtxxPuR paragraph-jt9VMa_X size1-Aclz5TEc"><a class="" href="https://presse.discovery.no/" rel="noopener" target="_blank">Presse</a></p></div></div><div class="base-1JDWzsKS divider-1J9xjEr7"></div><div class="logos-2tROKQvT"><a class="link-_ruDcDB7" href="/kanaler/tvnorge"><div class="logoAligner-3Lo3l93o"><img src="https://eu1-prod-images.disco-api.com/2018/11/16/channel-28-11261681250457276.png?w=108" class="logo-1DS_OQCW" alt="TVNorge"></div></a><a class="link-_ruDcDB7" href="/kanaler/fem"><div class="logoAligner-3Lo3l93o"><img src="https://eu1-prod-images.disco-api.com/2018/11/16/channel-29-11262316210706002.png?w=108" class="logo-1DS_OQCW" alt="FEM"></div></a><a class="link-_ruDcDB7" href="/kanaler/max"><div class="logoAligner-3Lo3l93o"><img src="https://eu1-prod-images.disco-api.com/2018/11/16/channel-30-11262268785616804.png?w=108" class="logo-1DS_OQCW" alt="MAX"></div></a><a class="link-_ruDcDB7" href="/kanaler/vox"><div class="logoAligner-3Lo3l93o"><img src="https://eu1-prod-images.disco-api.com/2018/11/16/channel-31-11261733016544693.png?w=108" class="logo-1DS_OQCW" alt="VOX"></div></a><a class="link-_ruDcDB7" href="/kanaler/discovery"><div class="logoAligner-3Lo3l93o"><img src="https://eu1-prod-images.disco-api.com/2019/10/08/channel-45-314717396207329.png?w=108" class="logo-1DS_OQCW" alt="Discovery"></div></a><a class="link-_ruDcDB7" href="/kanaler/animal-planet"><div class="logoAligner-3Lo3l93o"><img src="https://eu1-prod-images.disco-api.com/2019/01/22/channel-35-17020156064294169.PNG?w=108" class="logo-1DS_OQCW" alt="Animal Planet"></div></a><a class="link-_ruDcDB7" href="/kanaler/tlc"><div class="logoAligner-3Lo3l93o"><img src="https://eu1-prod-images.disco-api.com/2018/11/19/channel-15-4230971263537569.png?w=108" class="logo-1DS_OQCW" alt="TLC"></div></a><a class="link-_ruDcDB7" href="/kanaler/id"><div class="logoAligner-3Lo3l93o"><img src="https://eu1-prod-images.disco-api.com/2018/11/19/channel-73-4230992926516029.png?w=108" class="logo-1DS_OQCW" alt="Investigation Discovery"></div></a><a class="link-_ruDcDB7" href="/kanaler/discovery-science"><div class="logoAligner-3Lo3l93o"><img src="https://eu1-prod-images.disco-api.com/2019/10/08/channel-71-314744145281602.png?w=108" class="logo-1DS_OQCW" alt="Discovery Science"></div></a></div><section class="AppStoreLogosWrapper"><div class="base-1JDWzsKS divider-1J9xjEr7"></div></section><div class="copyrightContainer-2T6iDmRy"><p class="paragraph-3wtxxPuR copyright-2F2sRiJ4 size4-V7KSEEpz uppercase-IgQ1hyw0">Copyright © 2019 Discovery, Inc. or its subsidiaries and affiliates. All rights reserved.</p><a class="discoveryLogo-2PuZiJgQ" href="https://corporate.discovery.com/" rel="noopener" target="_blank"><img alt="Dplay" class="logo-3IfpM36Y" src="https://eu1-prod-images.disco-api.com/2019/3/26/35fc368d-4fb8-4c39-84a8-62eb61a8aeff.png"></a></div></div></footer></div></div></div><script>_satellite["__runScript1"](function(event, target) {
    
      try {
    
      var _hj_country_ids = {
        se : "767702",
        no : "767794",
        dk : "767799",
        fi : "1018217",
        jp : "1749918",
        nl : "1749920"
      }
      var _hj_ctry = /([a-z]{2})$/.exec(document.location.host)[0];
    
      if (_hj_country_ids.hasOwnProperty(_hj_ctry)){
        (function(h,o,t,j,a,r){
          h.hj=h.hj||function(){(h.hj.q=h.hj.q||[]).push(arguments)};
          h._hjSettings={hjid:_hj_country_ids[_hj_ctry],hjsv:6};
          a=o.getElementsByTagName('head')[0];
          r=o.createElement('script');r.async=1;
          r.src=t+h._hjSettings.hjid+j+h._hjSettings.hjsv;
          a.appendChild(r);
          })(window,document,'https://static.hotjar.com/c/hotjar-','.js?sv=');
      }
    
    
        } catch (e) {}
    
      });</script><script>_satellite["__runScript2"](function(event, target) {
      try{
    
      if(/no/i.test(_satellite.getVar("Environment:CountryCode"))){
      (function(win, doc, sdk_url){
        if(win.snaptr) return;
        var tr=win.snaptr=function(){
        tr.handleRequest? tr.handleRequest.apply(tr, arguments):tr.queue.push(arguments);
      };
        tr.queue = [];
        var s='script';
        var new_script_section=doc.createElement(s);
        new_script_section.async=!0;
        new_script_section.src=sdk_url;
        var insert_pos=doc.getElementsByTagName(s)[0];
        insert_pos.parentNode.insertBefore(new_script_section, insert_pos);
      })(window, document, 'https://sc-static.net/scevent.min.js');
       snaptr('init','d3df95e4-c2a5-49f3-91ea-1b91fb1a53af')
      }
    
      } catch (e) {}
      });</script><script>_satellite["__runScript3"](function(event, target) {
      try {
        window.dataLayer = window.dataLayer || [];
        window.gtag = function() {
              dataLayer.push(arguments);
          }
          var country_id = {
          no: "UA-57600485-7",
          dk: "UA-57600485-4",
          se: "DC-8313372",
          fi: "AW-797670288",
          jp: "AW-714777410"
          }
          //This should be reworked and generalized, not all pages have the countrycode as top level domain, added else on line 24 please refactor (KN 2019-08-01)
          var pos = document.location.hostname.split(".").length - 1;
          var cc = document.location.hostname.split(".")[pos];
          if (country_id.hasOwnProperty(cc)) {
            if (!document.getElementById('google-analytics-gtag-js')) {
          var script = document.createElement('script');
          script.src = "https://www.googletagmanager.com/gtag/js?id="+country_id[cc];
          script.async = true;
          script.id = "google-analytics-gtag-js"
          document.head.appendChild(script);
          }
          }
          else {
            if (country_id.hasOwnProperty(_satellite.getVar("Environment:CountryCode"))) {
          if (!document.getElementById('google-analytics-gtag-js')) {
            var script = document.createElement('script');
            script.src = "https://www.googletagmanager.com/gtag/js?id="+country_id[_satellite.getVar("Environment:CountryCode")];
            script.async = true;
            script.id = "google-analytics-gtag-js"
            document.head.appendChild(script);
          }
            }
          }
      } catch (e) {}
    
      /////////////////////MSA Nordics Google organic 20200602
      try{
          var cc = _satellite.getVar("Environment:CountryCode")
          if (/no|dk|se|fi/i.test(cc)){
    
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
    
          gtag('config', 'DC-9232428', {
          'dc_natural_search': {
          'exclusion_parameters': ['gclid\x3d*'],
    
                  'engines': {
                  'yahoo': '468297265;273992205;x',
                  'google': '468296951;273980697;k',
                  'aol': '468307456;273972811;s',
                  'ask': '468306601;273972808;p',
                  'msn': '468291560;273653897;a'
                  }
    
          }
    
          })
      }
      } catch (e) {}
      });</script><script>_satellite["__runScript4"](function(event, target) {
      //// Script load
    
      if (!document.getElementById("userreport-launcher-script")) {
        var script = document.createElement("script");
       script.id = "userreport-launcher-script";
        script.src = "https://sak.userreport.com/discovery/launcher.js";
        script.async = true;
        document.head.appendChild(script);
      }
      });</script><iframe sandbox="allow-scripts allow-same-origin" title="Adobe ID Syncing iFrame" id="destination_publishing_iframe_discovery_0" name="destination_publishing_iframe_discovery_0_name" src="https://discovery.demdex.net/dest5.html?d_nsid=0#https%3A%2F%2Fwww.dplay.no" class="aamIframeLoaded" style="display: none; width: 0px; height: 0px;"></iframe></body></html>
    

Conclusion

The website is JavaScript based so you need to wait for the WebElement to render within the DOM Tree before collecting the page_source


References

You can find a couple of relevant discussions in:

  • Is there a way with python-selenium to wait until all elements of a page has loaded?
  • Do we have any generic function to check if page has completely loaded in Selenium


来源:https://stackoverflow.com/questions/62740312/scraping-content-from-a-dynamic-webpage-with-selenium-returns-wrong-content

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!