问题
I have used JSOUP for scraping and its works perfectly till the ajax and javascript not playing their roles to display webpage content .
Now guys any clue , how to scrape those content which get displayed with ajax or by JavaScript after page get loads completely .
Thanks in advance !!
回答1:
You can use a headless browser as PhatomJS.
PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.
In order to ease your work, You could use CapserJS
CasperJS is a companion for PhatomJS which brings a greatly improved API to ease the creation of scraping and automation workflows.
These tools are very useful when you have to scrape a websites with dynamic content, for instance, websites where the content is displayed after it ran process in Javascript (sometimes including ajax calls).
You can see a example about how casper works here:
CasperJs and Jquery with chained Selects
回答2:
You can't do it directly with JSoup. You'll need a headless browser, which is a much more complex thing. There are headless versions of Firefox, Safari, and others. Searches for "headless X" (where X is the browser engine you want to use) should turn up some useful projects.
来源:https://stackoverflow.com/questions/16852660/how-to-scrape-ajax-loaded-content-with-jsoup