Tried Python BeautifulSoup and Phantom JS: STILL can't scrape websites

后端 未结 1 1009
离开以前
离开以前 2020-12-10 09:39

You may have seen my desperate frustrations over the past few weeks on here. I\'ve been scraping some wait time data and am still unable to grab data from these two sites

1条回答
  •  抹茶落季
    2020-12-10 09:48

    The problem you're facing is that the elements are created by JS, and it might take some time to load them. You need a scraper which handles JS, and can wait until the required elements are created.

    You can use PyQt4. Adapting this recipe from webscraping.com and a HTML parser like BeautifulSoup, this is pretty easy:

    (after writing this, I found the webscraping library for python. It might be worthy a look)

    import sys
    from bs4 import BeautifulSoup
    from PyQt4.QtGui import *
    from PyQt4.QtCore import *
    from PyQt4.QtWebKit import * 
    
    class Render(QWebPage):
        def __init__(self, url):
            self.app = QApplication(sys.argv)
            QWebPage.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            self.mainFrame().load(QUrl(url))
            self.app.exec_()
    
        def _loadFinished(self, result):
            self.frame = self.mainFrame()
            self.app.quit()   
    
    url = 'http://hcavirginia.com/home/'
    r = Render(url)
    soup = BeautifulSoup(unicode(r.frame.toHtml()))
    # In Python 3.x, don't unicode the output from .toHtml(): 
    #soup = BeautifulSoup(r.frame.toHtml()) 
    nums = [int(span) for span in soup.find_all('span', class_='ehc-er-digits')]
    print nums
    

    Output:

    [21, 23, 47, 11, 10, 8, 68, 56, 19, 15, 7]
    

    This was my original answer, using ghost.py:

    I managed to hack something together for you using ghost.py. (tested on Python 2.7, ghost.py 0.1b3 and PyQt4-4 32-bit). I wouldn't recommend to use this in production code though!

    from ghost import Ghost
    from time import sleep
    
    ghost = Ghost(wait_timeout=50, download_images=False)
    page, extra_resources = ghost.open('http://hcavirginia.com/home/',
                                       headers={'User-Agent': 'Mozilla/4.0'})
    
    # Halt execution of the script until a span.ehc-er-digits is found in 
    # the document
    page, resources = ghost.wait_for_selector("span.ehc-er-digits")
    
    # It should be possible to simply evaluate
    # "document.getElementsByClassName('ehc-er-digits');" and extract the data from
    # the returned dictionary, but I didn't quite understand the
    # data structure - hence this inline javascript.
    nums, resources = ghost.evaluate(
        """
        elems = document.getElementsByClassName('ehc-er-digits');
        nums = []
        for (i = 0; i < elems.length; ++i) {
            nums[i] = elems[i].innerHTML;
        }
        nums;
        """)
    
    wt_data = [int(x) for x in nums]
    print wt_data
    sleep(30) # Sleep a while to avoid the crashing of the script. Weird issue!
    

    Some comments:

    • As you can see from my comments, I didn't quite figure out the structure of the returned dict from Ghost.evaluate(document.getElementsByClassName('ehc-er-digits');) - its probably possible to find the information needed using such a query though.

    • I also had some problems with the script crashing at the end. Sleeping for 30 seconds fixed the issue.

    0 讨论(0)
提交回复
热议问题