问题
I'm trying to make it easy for users to input numbers from a web page. The easiest thing I can imagine would be for them to provide a url and an xpath associated with that number. My code could then go grab the numbers. The concept of an xpath isn't well-known (to non-coders), but it's trivial to find an xpath using Chrome's Inspect and Developer tools. So that's great.
The problem is that xpaths from Chrome and Firefox won't always get you a working xpath for use in an html parser as explained here: Why does this xpath fail using lxml in python?
Basically, browsers will change the source into a more technically correct form and then they will show this changed form to the user and base their xpaths on that form.
This problem could be repaired if there were an automatic way for your code to see not the page source, but Chrome's rendition of the page source. Is there an efficient, automatic way to do this?
One more time, more succinctly and exactly: how would I give python the altered HTML document that Chrome produces rather than the original source document to parse?
回答1:
The only way I see is to actually run a web engine...
With QtWebKit QWebFrame
you can use setHtml
, and toHtml will return the source code adapted by WebKit...
Obviously this is a big dependency, but just installing PySide will get you everything that's needed.
So this turned out to be a lot dirtier than I expected, at least the part that's needed to isolate Qt from other code. Using setHtml
doesn't seem to let you use toHtml
immediately; some asynchronous loading must happen...
It would probably make a lot more sense to look for some simpler WebKit bindings.
So, load_source
both downloads the data from an URL and returns the HTML after modification by WebKit. It wraps Qt's event loop with its asynchronous events, and is a blocking function.
setUrl
here can be replaced with setHtml
, if you want to do the download separately.
from PySide.QtCore import QObject, QUrl, Slot
from PySide.QtGui import QApplication
from PySide.QtWebKit import QWebPage, QWebSettings
qapp = QApplication([])
def load_source(url):
page = QWebPage()
page.settings().setAttribute(QWebSettings.AutoLoadImages, False)
page.mainFrame().setUrl(QUrl(url))
class State(QObject):
src = None
finished = False
@Slot()
def loaded(self, success=True):
self.finished = True
if self.src is None:
self.src = page.mainFrame().toHtml()
state = State()
# Optional; reacts to DOM ready, which happens before a full load
def js():
page.mainFrame().addToJavaScriptWindowObject('qstate$', state)
page.mainFrame().evaluateJavaScript('''
document.addEventListener('DOMContentLoaded', qstate$.loaded);
''')
page.mainFrame().javaScriptWindowObjectCleared.connect(js)
page.mainFrame().loadFinished.connect(state.loaded)
while not state.finished:
qapp.processEvents()
return state.src
Demonstration using the example from the linked question. Now it actually works...
from lxml import html
url = 'http://www.makospearguns.com/product-p/mcffgb.htm'
xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'
src = load_source(url)
tree = html.fromstring(src)
text = tree.xpath(xpath)
回答2:
Use Selenium. https://selenium-python.readthedocs.org
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://example.com')
html_source = browser.page_source
Than you can parse html_source
(Chrome browser source) with lxml.
来源:https://stackoverflow.com/questions/27390108/how-would-you-give-chromes-version-of-a-webpage-to-python