How would you give Chrome's version of a webpage to python?

问题

I'm trying to make it easy for users to input numbers from a web page. The easiest thing I can imagine would be for them to provide a url and an xpath associated with that number. My code could then go grab the numbers. The concept of an xpath isn't well-known (to non-coders), but it's trivial to find an xpath using Chrome's Inspect and Developer tools. So that's great.

The problem is that xpaths from Chrome and Firefox won't always get you a working xpath for use in an html parser as explained here: Why does this xpath fail using lxml in python?

Basically, browsers will change the source into a more technically correct form and then they will show this changed form to the user and base their xpaths on that form.

This problem could be repaired if there were an automatic way for your code to see not the page source, but Chrome's rendition of the page source. Is there an efficient, automatic way to do this?

One more time, more succinctly and exactly: how would I give python the altered HTML document that Chrome produces rather than the original source document to parse?

回答1:

The only way I see is to actually run a web engine...

With QtWebKit QWebFrame you can use setHtml, and toHtml will return the source code adapted by WebKit...

Obviously this is a big dependency, but just installing PySide will get you everything that's needed.

So this turned out to be a lot dirtier than I expected, at least the part that's needed to isolate Qt from other code. Using setHtml doesn't seem to let you use toHtml immediately; some asynchronous loading must happen...

It would probably make a lot more sense to look for some simpler WebKit bindings.

So, load_source both downloads the data from an URL and returns the HTML after modification by WebKit. It wraps Qt's event loop with its asynchronous events, and is a blocking function.

setUrl here can be replaced with setHtml, if you want to do the download separately.

from PySide.QtCore import QObject, QUrl, Slot
from PySide.QtGui import QApplication
from PySide.QtWebKit import QWebPage, QWebSettings

qapp = QApplication([])

def load_source(url):
    page = QWebPage()
    page.settings().setAttribute(QWebSettings.AutoLoadImages, False)
    page.mainFrame().setUrl(QUrl(url))

    class State(QObject):
        src = None
        finished = False

        @Slot()
        def loaded(self, success=True):
            self.finished = True
            if self.src is None:
                self.src = page.mainFrame().toHtml()
    state = State()

    # Optional; reacts to DOM ready, which happens before a full load
    def js():
        page.mainFrame().addToJavaScriptWindowObject('qstate$', state)
        page.mainFrame().evaluateJavaScript('''
            document.addEventListener('DOMContentLoaded', qstate$.loaded);
        ''')
    page.mainFrame().javaScriptWindowObjectCleared.connect(js)

    page.mainFrame().loadFinished.connect(state.loaded)

    while not state.finished:
        qapp.processEvents()

    return state.src

Demonstration using the example from the linked question. Now it actually works...

from lxml import html

url = 'http://www.makospearguns.com/product-p/mcffgb.htm'
xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'

src = load_source(url)

tree = html.fromstring(src)
text = tree.xpath(xpath)

回答2:

Use Selenium. https://selenium-python.readthedocs.org

from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://example.com')
html_source = browser.page_source

Than you can parse html_source (Chrome browser source) with lxml.

来源：https://stackoverflow.com/questions/27390108/how-would-you-give-chromes-version-of-a-webpage-to-python

标签

python

google-chrome

web-scraping

lxml