Download html of URL with Python - but with javascript enabled

问题

I am trying to download this page so that I can scrape the search results. However, when I download the page and try to process it with BeautifulSoup, I find that parts of the page (for example, the search results) aren't included as the site has detected that javascript is not enabled.

Is there a way to download the HTML of a URL with javascript enabled in Python?

回答1:

@kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There's the PhantomJS (C++) project, or PyPhantomJS (Python). Basically the Python one is QtWebKit and Python.

They're both headless browsers which you can control directly from JavaScript. The Python version has a plug-in system which allows you to extend the core too, to allow additional functionalities should you need.

Here's an example script for PyPhantomJS (with the saveToFile plugin)

// create new webpage
var page = new WebPage();

// open page, set callback
page.open('url', function(status) {
    // exit if page couldn't load
    if (status !== 'success') {
        console.log('FAIL to load!');
        phantom.exit(1);
    }

    // save page content to file
    phantom.saveToFile(page.content, 'myfile.txt');
    phantom.exit();
});

Useful links:
API reference | How to write plugins

回答2:

I'd look into using the QtWebKit module in the PyQt4 library. The module will let the JS code run on the page and once it's done, you can save the HTML using standard methods I believe.

Otherwise, Selenium is the way to go. It lets you control a web browser from your Python script to pull up the page and then extract all the DOM stuff.

回答3:

Once you wanta javascript enabled, what you're asking for is very close to a browser. You can use jython and then use HtmlUnit, which is a headless java based browser. It's pretty fast but not very stable (because is imitates a browser and isn't really a browser). I think the fastest and easiest way is to use selenium (ide or preferably rc). Selenium gives you the ability to control your favorite browser (FF, IE, chrome,..). Although it's meant for testing puposes, it'll probably work for you. It's stable and pretty fast (I think it's even faster than HtmlUnit).

回答4:

You can use htql at http://htql.net.

import htql;
browser=htql.Browser(2);
page, url=browser.goUrl('http://docs.python.org/search.html?q=chdir&check_keywords=yes&area=default');
import time; 
time.sleep(2);
page, url=browser.getUpdatedPage();

BTW, you will need to install IRobot at http://irobotsoft.com/

来源：https://stackoverflow.com/questions/6630214/download-html-of-url-with-python-but-with-javascript-enabled

标签

python

screen-scraping