Extracting data from Web

久未见 提交于 2019-12-06 04:32:21
zenpoy

I believe the problem is that these values are rendered through a javascript code which your browser runs and urllib doesn't - You should use a library that can execute javascript code.

I just googled crawler python javascript and I got the some stackoverflow questions and answers which recommends the use of selenium or webkit. You can use those libraries through scrapy. Here are two snippets:

Rendered/interactive javascript with gtk/webkit/jswebkit

Rendered Javascript Crawler With Scrapy and Selenium RC

I have been working on this same exact issue. I have been introduced to Beautifulsoup and later since learned about Scrapy. Beautifulsoup is very easy to use, especially if you're new at this. Scrapy apparently has more "features", but I believe you can accomplish your needs with Beautifulsoup.

I had the same issues about not being able to gain access to a website that loaded information through Javascript and thankfully Selenium was the savior.

A great introduction to Selenium can be found here.

Install: pip install selenium

Below is a simple class I put together. You can save it as a .py file and import it into your project. If you call the method retrieve_source_code(self, domain) and send the hyperlink that you are trying to parse it will return the source code of the fully loaded page when you can then put into Beautifulsoup and find the information you're looking for!

Ex:

airfare_url = 'http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html'

soup = BeautifulSoup(SeleniumWebScraper.retrieve_source_code(airfare_url))

Now you can parse soup like you normally would with Beautifulsoup.

I hope that helps you!

from selenium import webdriver
import requests

class SeleniumWebScraper():

    def __init__(self):
        self.source_code = ''
        self.is_page_loaded = 0
        self.driver = webdriver.Firefox()
        self.is_browser_closed = 0
        # To ensure the page has fully loaded we will 'implicitly' wait 
        self.driver.implicitly_wait(10)  # Seconds

    def close(self):
        self.driver.close()
        self.clear_source_code()
        self.is_page_loaded = 0
        self.is_browser_closed = 1

    def clear_source_code(self):
        self.source_code = ''
        self.is_page_loaded = 0

    def retrieve_source_code(self, domain):
        if self.is_browser_closed:
            self.driver = webdriver.Firefox()
        # The driver.get method will navigate to a page given by the URL.
        #  WebDriver will wait until the page has fully loaded (that is, the "onload" event has fired)
        #  before returning control to your test or script.
        # It's worth nothing that if your page uses a lot of AJAX on load then
        #  WebDriver may not know when it has completely loaded.
        self.driver.get(domain)

        self.is_page_loaded = 1
        self.source_code = self.driver.page_source
        return self.source_code
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!