问题
I am building a spider and I am using Beautiful soup to parse the contain of particular URL. Now, some sites are using Java-script to show dynamic contain which will be shown to user once some action [clicking or time] happens. Beautiful soup just parse the static contain which is before the java-script tag has run. I want the contain after java-script run. Is there any way to do this?
I can think of one way: Grab the url, open the browser and run this URL and java-script tags as well. And then pass this url to Beautiful soup, which can see contains which java-script[dynamic contains] has produced. However, if I am crawling millions of links then this solution is not useful. If there is some in-built module available which can generate dynamic contain of the Html page before hand.
回答1:
Your best bet for accurately parsing Javascript-enhanced content from web pages is to load the page via a browser engine. Luckily there are ways to automate this in Python.
The method I've had the most success with is to use the pywebkitgtk project which lets you programmatically create and control instances of the Webkit browser engine from within a Python application. I also use the jswebkit module to simplify execution of Javascript in the page context.
Another option is PyQt4's QtWebKit class which I've only used for experimentation.
Here is a working example of using pywebkitgtk and jswebkit together to extract data from a Webkit-rendered page. In a production environment you'll want to run several of these processors in parallel, each rendering to its own instance of the X virtual framebuffer (Xvfb).
import os
import gtk
import jswebkit
import lxml.html
import pygtk
import webkit
def load_finished(view, frame):
# called when the document finishes loading
if frame != view.get_main_frame():
return
ctx = jswebkit.JSContext(frame.get_global_context())
res = ctx.EvaluateScript('window.location.href')
print res
res = ctx.EvaluateScript('document.body.innerHTML')
tree = lxml.html.fromstring(res)
print tree.xpath('//input[@type="submit"]')
# initialization
pygtk.require20()
gtk.gdk.threads_init()
# create the webview and hook up callbacks to signals
view = webkit.WebView()
view.set_size_request(1024, 768)
view.connect('load-finished', load_finished)
# configure the webview
props = view.get_settings()
props.set_property('enable-java-applet', False)
props.set_property('enable-plugins', False)
props.set_property('enable-page-cache', False)
# create a window to host the webview
win = gtk.Window()
win.add(view)
win.show_all()
# open google front page
view.open('http://www.google.com')
# spin, processing gtk events
while True:
try:
while gtk.events_pending():
gtk.main_iteration(False)
except KeyboardInterrupt:
break
Example output:
http://www.google.com/
[<InputElement 2a64a78 name='btnG' type='submit'>, <InputElement 2a64bb0 name='btnG' type='submit'>, <InputElement 2a64ae0 name='btnI' type='submit'>]
来源:https://stackoverflow.com/questions/5738024/how-to-parse-java-script-containsdynamic-on-web-pagehtml-using-python