web scraping dynamic content with python

前端 未结 3 1745
暖寄归人
暖寄归人 2020-11-27 16:44

I\'d like to use Python to scrape the contents of the \"Were you looking for these authors:\" box on web pages like this one: http://academic.research.microsoft.com/Search?q

相关标签:
3条回答
  • 2020-11-27 17:17

    For scraping dynamic content, you need not a simple scraper but a full-fledged headless browser.

    dhamaniasad/HeadlessBrowsers: A list of (almost) all headless web browsers in existence is the fullest list of these that I've seen; it lists which languages each has bindings for.

    (Note that more than a few of the listed projects are abandoned!)

    0 讨论(0)
  • 2020-11-27 17:20

    A very similar question was asked earlier here. Quoted is selenium, originally a testing environment for web-apps.

    I usually use Chrome's Developer Mode, which IMHO already gives even more details than Firefox.

    0 讨论(0)
  • 2020-11-27 17:30

    Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.

    If you run the following query in a chrome console, you'll see it returns everything you want.

    document.getElementsByClassName('inline-text-org');
    

    Returns

    [<div class=​"inline-text-org" title=​"University of Manchester">​University of Manchester​</div>, 
     <div class=​"inline-text-org" title=​"University of California Irvine">​University of California ...​</div>​
      etc...
    

    You can run JavaScript through python in a real life DOM using ghost.py.

    This is really cool:

    from ghost import Ghost
    ghost = Ghost()
    page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
    result, resources = ghost.evaluate(
        "document.getElementsByClassName('inline-text-org');")
    
    0 讨论(0)
提交回复
热议问题