How to Parse Java-script contains[dynamic] on web-page[html] using Python?

偶尔善良 提交于 2019-12-08 04:34:37

问题


I am building a spider and I am using Beautiful soup to parse the contain of particular URL. Now, some sites are using Java-script to show dynamic contain which will be shown to user once some action [clicking or time] happens. Beautiful soup just parse the static contain which is before the java-script tag has run. I want the contain after java-script run. Is there any way to do this?

I can think of one way: Grab the url, open the browser and run this URL and java-script tags as well. And then pass this url to Beautiful soup, which can see contains which java-script[dynamic contains] has produced. However, if I am crawling millions of links then this solution is not useful. If there is some in-built module available which can generate dynamic contain of the Html page before hand.


回答1:


Your best bet for accurately parsing Javascript-enhanced content from web pages is to load the page via a browser engine. Luckily there are ways to automate this in Python.

The method I've had the most success with is to use the pywebkitgtk project which lets you programmatically create and control instances of the Webkit browser engine from within a Python application. I also use the jswebkit module to simplify execution of Javascript in the page context.

Another option is PyQt4's QtWebKit class which I've only used for experimentation.

Here is a working example of using pywebkitgtk and jswebkit together to extract data from a Webkit-rendered page. In a production environment you'll want to run several of these processors in parallel, each rendering to its own instance of the X virtual framebuffer (Xvfb).

import os

import gtk
import jswebkit
import lxml.html
import pygtk
import webkit

def load_finished(view, frame):
    # called when the document finishes loading
    if frame != view.get_main_frame():
        return
    ctx = jswebkit.JSContext(frame.get_global_context())
    res = ctx.EvaluateScript('window.location.href')
    print res
    res = ctx.EvaluateScript('document.body.innerHTML')
    tree = lxml.html.fromstring(res)
    print tree.xpath('//input[@type="submit"]')

# initialization
pygtk.require20()
gtk.gdk.threads_init()

# create the webview and hook up callbacks to signals
view = webkit.WebView()
view.set_size_request(1024, 768)
view.connect('load-finished', load_finished)

# configure the webview
props = view.get_settings()
props.set_property('enable-java-applet', False)
props.set_property('enable-plugins', False)
props.set_property('enable-page-cache', False)

# create a window to host the webview
win = gtk.Window()
win.add(view)
win.show_all()

# open google front page
view.open('http://www.google.com')

# spin, processing gtk events
while True:
    try:
        while gtk.events_pending():
            gtk.main_iteration(False)
    except KeyboardInterrupt:
        break

Example output:

http://www.google.com/
[<InputElement 2a64a78 name='btnG' type='submit'>, <InputElement 2a64bb0 name='btnG' type='submit'>, <InputElement 2a64ae0 name='btnI' type='submit'>]


来源:https://stackoverflow.com/questions/5738024/how-to-parse-java-script-containsdynamic-on-web-pagehtml-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!