Sending cookies in request with crawler4j?

懵懂的女人 提交于 2020-01-04 04:24:08

问题


I need to grab some links that are depending on the sent cookies within a GET Request. So when I want to crawl the page with crawler4j I need to send some cookies with it to get the correct page back.

Is this possible (I searched the web for it, but didn't find something useful)? Or is there a Java crawler out there who is capable doing this?

Any help appreciated.


回答1:


It appears that crawler4j might not support cookies: http://www.webuseragents.com/ua/427106/crawler4j-http-code-google-com-p-crawler4j-

There are several alternatives:

  • Nutch
  • Heritrix
  • WebSPHINX
  • JSpider
  • WebEater
  • WebLech
  • Arachnid
  • JoBo
  • Web-Harvest
  • Ex-Crawler
  • Bixo

I would say that Nutch and Heritrix are the best ones and I would put special emphasis on Nutch, because it's probably one of the only crawlers that is designed to scale well and actually perform a big crawl.




回答2:


Coming late to this thread but actually crawler4j does a good job of handling cookies. You can even inspect cookie values because you can get hold of the underlying HTTP client (apache). For example:

@Override
public void visit(Page page) {
    super.visit(page);

    DefaultHttpClient httpClient = (DefaultHttpClient) getMyController().getPageFetcher().getHttpClient();
    for (Cookie cookie : httpClient.getCookieStore().getCookies()) {
        if ( cookie.getName().equals("somename") ) {
            String value = cookie.getValue();
        }
    }
}

I looked briefly at Nutch but crawler4j seems simpler to integrate (5 minutes using maven dependency) and was perfect for my needs (I was testing that session cookie is maintained on my site across a large number of requests).



来源:https://stackoverflow.com/questions/8536557/sending-cookies-in-request-with-crawler4j

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!