nutch | 易学教程

How to modify search result page given by Solr?

阅读更多关于 How to modify search result page given by Solr?

问题 I intend to make a niche search engine. I am using apache-nutch-1.6 as the crawler and apache-solr-3.6.2 as the searcher. I must say there is very less updated information on web about these technologies. I followed this tutorial http://wiki.apache.org/nutch/NutchTutorial and have successfully installed apache and solr on my ubuntu system. I was also successful in injecting seed url to webdb and perform the crawl. Using solr interface at http://localhost:8983/solr/admin , I can also query the

how to bypass robots.txt with apache nutch 2.2.1

阅读更多关于 how to bypass robots.txt with apache nutch 2.2.1

问题 Can anyone please tell me if there is any way for apache nutch to ignore or bypass robots.txt while crawling. I am using nutch 2.2.1. I found that "RobotRulesParser.java"(full path:-src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ RobotRulesParser.java) is responsible for the reading and parsing the robots.txt. Is there any way to modify this file to ignore robots.txt and go on with crawling? Or is there any other way to achieve the same? 回答1: At first, we should respect the

crawling with Nutch 2.3, Cassandra 2.0, and solr 4.10.3 returns 0 results

阅读更多关于 crawling with Nutch 2.3, Cassandra 2.0, and solr 4.10.3 returns 0 results

问题 I mainly followed the guide on this page. I installed Nutch 2.3, Cassandra 2.0, and solr 4.10.3. Set up went well. But when I executed the following command. No urls were fetched. ./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2 Below are my settings. nutch-site.xml http://ideone.com/H8MPcl regex-urlfilter.txt +^http://([a-z0-9]*\.)*nutch.apache.org/ hadoop.log http://ideone.com/LnpAw4 I don't see any errors in the log file. I am really lost. Any help would be appreciated.

Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

阅读更多关于 Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

问题 First and foremost I'm a Nutch/Hadoop newbie. I have installed Cassandra. I have installed Nutch on the Master node of my EMR cluster. When I attempt to execute a crawl using the following command: sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5 I get Error: JAVA_HOME is not set. If I run the command without 'sudo' I get: Injector: starting at 2014-07-16 02:12:24 Injector: crawlDb: urls/crawldb Injector: urlDir: crawl Injector: Converting injected urls to crawl db entries. Injector: org

Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

阅读更多关于 Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

Apache Nutch 2.3.1 Website home page handling

阅读更多关于 Apache Nutch 2.3.1 Website home page handling

问题 I have configured Nutch 2.3.1 to crawl some news websites. As websites homepages are going to change after one day that why I want to handle home page in some different way so that for homepage, only main categories are crawled instead of text as text will change after sometime ( I have observed similar things in Google). For rest of pages, its working fine ( crawling text etc.) 回答1: At the moment Nutch doesn't offer any special treatment for homepages, it is just one more URL to crawl. If

apache nutch don't crawl website

阅读更多关于 apache nutch don't crawl website

问题 I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt : User-Agent: * Disallow: / Is there any way to crawl this website with apache nutch? 回答1: In nutch-site.xml, set protocol.plugin.check.robots to false OR You can comment out the code where the robots check is done. In Fetcher.java, lines 605-614 are doing the check. Comment that entire block if (!rules.isAllowed(fit.u)) { // unblock fetchQueues.finishFetchItem(fit, true); if (LOG

whether method cancel() and method interrupt() do the duplicate job?

阅读更多关于 whether method cancel() and method interrupt() do the duplicate job?

问题 I read the source of org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content) . Do these two method calls do the same thing: Instruction 1: t.interrupt(); Instruction 2: task.cancel(true); The source of the org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content) is: ParseCallable pc = new ParseCallable(p, content); FutureTask<ParseResult> task = new FutureTask<ParseResult>(pc); ParseResult res = null; Thread t = new Thread(task); t.start(); try { res = task.get(MAX

Nutch-Cygwin How to set JAVA_HOME

阅读更多关于 Nutch-Cygwin How to set JAVA_HOME

问题 i am trying to run Nutch with Cygwin. I am having problems setting the JAVA_HOME. $ export JAVA_HOME='/cygdrive/f/program files/java/jdk1.6.0_21' When i run nutch command $ bin/nutch crawl i get cygpath: can't convert empty path bin/nutch: line 268: /cygdrive/f/program: No such file or directory bin/nutch: line 268: exec: /cygdrive/f/program: cannot execute: No such file or directory It appears that the space between program and files causes the problem /cygdrive/f/**program files**/java/jdk1

Nutch-Cygwin How to set JAVA_HOME

阅读更多关于 Nutch-Cygwin How to set JAVA_HOME