Download html of URL with Python - but with javascript enabled

天涯浪子 提交于 2019-12-04 21:58:46

@kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There's the PhantomJS (C++) project, or PyPhantomJS (Python). Basically the Python one is QtWebKit and Python.

They're both headless browsers which you can control directly from JavaScript. The Python version has a plug-in system which allows you to extend the core too, to allow additional functionalities should you need.

Here's an example script for PyPhantomJS (with the saveToFile plugin)

// create new webpage
var page = new WebPage();

// open page, set callback
page.open('url', function(status) {
    // exit if page couldn't load
    if (status !== 'success') {
        console.log('FAIL to load!');
        phantom.exit(1);
    }

    // save page content to file
    phantom.saveToFile(page.content, 'myfile.txt');
    phantom.exit();
});

Useful links:
API reference | How to write plugins

I'd look into using the QtWebKit module in the PyQt4 library. The module will let the JS code run on the page and once it's done, you can save the HTML using standard methods I believe.

Otherwise, Selenium is the way to go. It lets you control a web browser from your Python script to pull up the page and then extract all the DOM stuff.

Once you wanta javascript enabled, what you're asking for is very close to a browser. You can use jython and then use HtmlUnit, which is a headless java based browser. It's pretty fast but not very stable (because is imitates a browser and isn't really a browser). I think the fastest and easiest way is to use selenium (ide or preferably rc). Selenium gives you the ability to control your favorite browser (FF, IE, chrome,..). Although it's meant for testing puposes, it'll probably work for you. It's stable and pretty fast (I think it's even faster than HtmlUnit).

You can use htql at http://htql.net.

import htql;
browser=htql.Browser(2);
page, url=browser.goUrl('http://docs.python.org/search.html?q=chdir&check_keywords=yes&area=default');
import time; 
time.sleep(2);
page, url=browser.getUpdatedPage();

BTW, you will need to install IRobot at http://irobotsoft.com/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!