apache nutch don't crawl website

久未见 提交于 2020-01-05 07:14:36

问题


I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt:

User-Agent: *
Disallow: /

Is there any way to crawl this website with apache nutch?


回答1:


In nutch-site.xml, set protocol.plugin.check.robots to false

OR

You can comment out the code where the robots check is done. In Fetcher.java, lines 605-614 are doing the check. Comment that entire block

      if (!rules.isAllowed(fit.u)) {
        // unblock
        fetchQueues.finishFetchItem(fit, true);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Denied by robots.txt: " + fit.url);
        }
        output(fit.url, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE);
        reporter.incrCounter("FetcherStatus", "robots_denied", 1);
        continue;
      }



回答2:


You can set the property "Protocol.CHECK_ROBOTS" to false in nutch-site.xml to ignore robots.txt.



来源:https://stackoverflow.com/questions/11842913/apache-nutch-dont-crawl-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!