Using Python requests.get to parse html code that does not load at once

后端 未结 2 1553
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-21 05:48

I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in

相关标签:
2条回答
  • 2020-12-21 06:15

    You are not correct in your assessment of the problem.

    You can check the results and see that there's a </html> right near the end. That means you've got the whole page.

    And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.

    Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.

    There are a number of general solutions to that. For example:

    • Use selenium or similar to drive an actual browser to download the page.
    • Manually work out what the JavaScript code does and do equivalent work in Python.
    • Run a headless JavaScript interpreter against a DOM that you've built up.
    0 讨论(0)
  • 2020-12-21 06:22

    The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:

    from selenium import webdriver
    
    browser = webdriver.PhantomJS()
    browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
    html = browser.page_source
    print(html)
    
    0 讨论(0)
提交回复
热议问题