Download html of URL with Python - but with javascript enabled

限于喜欢 提交于 2019-12-22 01:31:34

问题


I am trying to download this page so that I can scrape the search results. However, when I download the page and try to process it with BeautifulSoup, I find that parts of the page (for example, the search results) aren't included as the site has detected that javascript is not enabled.

Is there a way to download the HTML of a URL with javascript enabled in Python?


回答1:


@kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There's the PhantomJS (C++) project, or PyPhantomJS (Python). Basically the Python one is QtWebKit and Python.

They're both headless browsers which you can control directly from JavaScript. The Python version has a plug-in system which allows you to extend the core too, to allow additional functionalities should you need.

Here's an example script for PyPhantomJS (with the saveToFile plugin)

// create new webpage
var page = new WebPage();

// open page, set callback
page.open('url', function(status) {
    // exit if page couldn't load
    if (status !== 'success') {
        console.log('FAIL to load!');
        phantom.exit(1);
    }

    // save page content to file
    phantom.saveToFile(page.content, 'myfile.txt');
    phantom.exit();
});

Useful links:
API reference | How to write plugins




回答2:


I'd look into using the QtWebKit module in the PyQt4 library. The module will let the JS code run on the page and once it's done, you can save the HTML using standard methods I believe.

Otherwise, Selenium is the way to go. It lets you control a web browser from your Python script to pull up the page and then extract all the DOM stuff.




回答3:


Once you wanta javascript enabled, what you're asking for is very close to a browser. You can use jython and then use HtmlUnit, which is a headless java based browser. It's pretty fast but not very stable (because is imitates a browser and isn't really a browser). I think the fastest and easiest way is to use selenium (ide or preferably rc). Selenium gives you the ability to control your favorite browser (FF, IE, chrome,..). Although it's meant for testing puposes, it'll probably work for you. It's stable and pretty fast (I think it's even faster than HtmlUnit).




回答4:


You can use htql at http://htql.net.

import htql;
browser=htql.Browser(2);
page, url=browser.goUrl('http://docs.python.org/search.html?q=chdir&check_keywords=yes&area=default');
import time; 
time.sleep(2);
page, url=browser.getUpdatedPage();

BTW, you will need to install IRobot at http://irobotsoft.com/



来源:https://stackoverflow.com/questions/6630214/download-html-of-url-with-python-but-with-javascript-enabled

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!