Mechanize and Javascript

断了今生、忘了曾经 提交于 2019-11-26 22:14:04

I've played with this new alternative to Mechanize (which I love) called Phantom JS.

It is a full web kit browser like Safari or Chrome but is headless and scriptable. You script it with javascript, not python (as far as I know at least).

There are some example scripts to get you started. It's a lot like using Firebug. I've only spent a few min using it but I found I was quite productive right from the start.

From http://wwwsearch.sourceforge.net/mechanize/faq.html#general

If you come across this in a page you want to automate, you have four options. Here they are, roughly in order of simplicity.

Figure out what the JavaScript is doing and emulate it in your Python code: for example, by manually adding cookies to your CookieJar instance, calling methods on HTMLForms, calling urlopen, etc. See above re forms.

Use Java’s HtmlUnit or HttpUnit from Jython, since they know some JavaScript.

Instead of using mechanize, automate a browser instead. For example use MS Internet Explorer via its COM automation interfaces, using the Python for Windows extensions, aka pywin32, aka win32all (e.g. simple function, pamie; pywin32 chapter from the O’Reilly book) or ctypes (example). This kind of thing may also come in useful on Windows for cases where the automation API is lacking. For Firefox, there is PyXPCOM.

Get ambitious and automatically delegate the work to an appropriate interpreter (Mozilla’s JavaScript interpreter, for instance). This is what HtmlUnit and httpunit do. I did a spike along these lines some years ago, but I think it would (still) be quite a lot of work to do well.

Basically if you want something that deals with javascript then you need a real javascript engine, these invariably involve automating a real browser (I'm including headless ones in this).

Java’s HtmlUnit doesn't do a very good job as it doesn't use a javascript engine from an actual browser. Phantom JS sounds ideal (as newz2000 points out) however I find that when manipulating pages with javascript it can be very difficult to debug your script if you can't actually see the page you're dealing with.

This leads to solutions such as Selenium Webdriver which has a full python API to automate various browsers, however you must run a java jar and it actually launches the browser, so not the pure python solution you're after (but I think this is as close as you can get).

You can use Selenium with Python. You can then scrape JavaScript-generated content as well as manipulate the page with additional JavaScript (as well as Python).

# In your virtualenv: pip install selenium
from selenium import webdriver

# Launch Firefox GUI
browser = webdriver.Firefox()

# Alternatively, you can drive PhantomJS without a GUI
# With Node.js installed: `npm install -g phantomjs`
# browser = webdriver.PhantomJS()

# Fetch a webpage
browser.get('http://example.com')

# If you need the whole HTML document
# just like inspecting the rendered page with the console
html = browser.page_source

# Get an element, even if it was created with JS
button = browser.find_element_by_css_selector('div.some-class > \
                                               input.the-submit-button')

# Click on something
button.click()

# Execute some JavaScript (assumes jQuery is loaded on the page)
browser.execute_script("$('html, body').animate({ scrollTop: 500 }, 50);")

You can run the code in a Python REPL and use autocomplete to discover the methods available on browser or whatever element you have selected. Or do something like print(dir(browser)) to see what is available.

An example how to use PyV8, to run JS on a DOM with python can be found here:

https://github.com/buffer/thug

This should be fairly easy to make it run together with mechanize.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!