Using Nutch how to crawl the dynamic content of web page that are uisng ajax?

馋奶兔 提交于 2019-12-23 15:46:14

问题


I am using apache Nutch 1.10 to crawl the web pages and to extract the contents in the page. Some of the links contains dynamic contents which are loaded on the call of ajax. Nutch cannot able to crawl and extract the dynamic contents of ajax. How can I solve this? Is there any solution? if yes please help me with your answers.

Thanks in advance.


回答1:


Most web crawler libraries do not offer javascript rendering out of the box. You usually have to plugin another library or product that offers js rendering like Selenium or PhantomJS.

Here is a tutorial using nutch and Selenium.




回答2:


Checkout the latest Nutch 1.11 trunk which includes a new plugin protocol-interactive selenium. (https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-interactiveselenium)

This plugin allows you to write your own handler and execute javascript to get dynamic content.



来源:https://stackoverflow.com/questions/32966642/using-nutch-how-to-crawl-the-dynamic-content-of-web-page-that-are-uisng-ajax

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!