web scraping a problem site

问题

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.

Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.

A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT

I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.

回答1:

The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542

See the corresponding javascript code on the original page:

<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
 populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals   e,type:"once"});
</script>

回答2:

The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.

Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.

回答3:

The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.

回答4:

As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.

The class gives you full access to the DOM tree, so you can do whatever you want with it.

http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx

来源：https://stackoverflow.com/questions/1546089/web-scraping-a-problem-site

标签

python

screen-scraping