apache nutch don't crawl website
问题 I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt : User-Agent: * Disallow: / Is there any way to crawl this website with apache nutch? 回答1: In nutch-site.xml, set protocol.plugin.check.robots to false OR You can comment out the code where the robots check is done. In Fetcher.java, lines 605-614 are doing the check. Comment that entire block if (!rules.isAllowed(fit.u)) { // unblock fetchQueues.finishFetchItem(fit, true); if (LOG