nutch

How to modify search result page given by Solr?

岁酱吖の 提交于 2020-01-15 06:34:19
问题 I intend to make a niche search engine. I am using apache-nutch-1.6 as the crawler and apache-solr-3.6.2 as the searcher. I must say there is very less updated information on web about these technologies. I followed this tutorial http://wiki.apache.org/nutch/NutchTutorial and have successfully installed apache and solr on my ubuntu system. I was also successful in injecting seed url to webdb and perform the crawl. Using solr interface at http://localhost:8983/solr/admin , I can also query the

how to bypass robots.txt with apache nutch 2.2.1

北城余情 提交于 2020-01-07 06:46:10
问题 Can anyone please tell me if there is any way for apache nutch to ignore or bypass robots.txt while crawling. I am using nutch 2.2.1. I found that "RobotRulesParser.java"(full path:-src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ RobotRulesParser.java) is responsible for the reading and parsing the robots.txt. Is there any way to modify this file to ignore robots.txt and go on with crawling? Or is there any other way to achieve the same? 回答1: At first, we should respect the

crawling with Nutch 2.3, Cassandra 2.0, and solr 4.10.3 returns 0 results

守給你的承諾、 提交于 2020-01-06 23:43:25
问题 I mainly followed the guide on this page. I installed Nutch 2.3, Cassandra 2.0, and solr 4.10.3. Set up went well. But when I executed the following command. No urls were fetched. ./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2 Below are my settings. nutch-site.xml http://ideone.com/H8MPcl regex-urlfilter.txt +^http://([a-z0-9]*\.)*nutch.apache.org/ hadoop.log http://ideone.com/LnpAw4 I don't see any errors in the log file. I am really lost. Any help would be appreciated.

Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

依然范特西╮ 提交于 2020-01-06 20:35:34
问题 First and foremost I'm a Nutch/Hadoop newbie. I have installed Cassandra. I have installed Nutch on the Master node of my EMR cluster. When I attempt to execute a crawl using the following command: sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5 I get Error: JAVA_HOME is not set. If I run the command without 'sudo' I get: Injector: starting at 2014-07-16 02:12:24 Injector: crawlDb: urls/crawldb Injector: urlDir: crawl Injector: Converting injected urls to crawl db entries. Injector: org

Nutch - Getting Error: JAVA_HOME is not set. when trying to crawl

霸气de小男生 提交于 2020-01-06 20:33:57
问题 First and foremost I'm a Nutch/Hadoop newbie. I have installed Cassandra. I have installed Nutch on the Master node of my EMR cluster. When I attempt to execute a crawl using the following command: sudo bin/crawl crawl urls -dir crawl -depth 3 -topN 5 I get Error: JAVA_HOME is not set. If I run the command without 'sudo' I get: Injector: starting at 2014-07-16 02:12:24 Injector: crawlDb: urls/crawldb Injector: urlDir: crawl Injector: Converting injected urls to crawl db entries. Injector: org

Apache Nutch 2.3.1 Website home page handling

情到浓时终转凉″ 提交于 2020-01-06 04:45:48
问题 I have configured Nutch 2.3.1 to crawl some news websites. As websites homepages are going to change after one day that why I want to handle home page in some different way so that for homepage, only main categories are crawled instead of text as text will change after sometime ( I have observed similar things in Google). For rest of pages, its working fine ( crawling text etc.) 回答1: At the moment Nutch doesn't offer any special treatment for homepages, it is just one more URL to crawl. If

apache nutch don't crawl website

久未见 提交于 2020-01-05 07:14:36
问题 I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt : User-Agent: * Disallow: / Is there any way to crawl this website with apache nutch? 回答1: In nutch-site.xml, set protocol.plugin.check.robots to false OR You can comment out the code where the robots check is done. In Fetcher.java, lines 605-614 are doing the check. Comment that entire block if (!rules.isAllowed(fit.u)) { // unblock fetchQueues.finishFetchItem(fit, true); if (LOG

whether method cancel() and method interrupt() do the duplicate job?

我是研究僧i 提交于 2020-01-04 10:58:11
问题 I read the source of org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content) . Do these two method calls do the same thing: Instruction 1: t.interrupt(); Instruction 2: task.cancel(true); The source of the org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content) is: ParseCallable pc = new ParseCallable(p, content); FutureTask<ParseResult> task = new FutureTask<ParseResult>(pc); ParseResult res = null; Thread t = new Thread(task); t.start(); try { res = task.get(MAX

Nutch-Cygwin How to set JAVA_HOME

霸气de小男生 提交于 2020-01-03 11:11:49
问题 i am trying to run Nutch with Cygwin. I am having problems setting the JAVA_HOME. $ export JAVA_HOME='/cygdrive/f/program files/java/jdk1.6.0_21' When i run nutch command $ bin/nutch crawl i get cygpath: can't convert empty path bin/nutch: line 268: /cygdrive/f/program: No such file or directory bin/nutch: line 268: exec: /cygdrive/f/program: cannot execute: No such file or directory It appears that the space between program and files causes the problem /cygdrive/f/**program files**/java/jdk1

Nutch-Cygwin How to set JAVA_HOME

ぃ、小莉子 提交于 2020-01-03 11:11:46
问题 i am trying to run Nutch with Cygwin. I am having problems setting the JAVA_HOME. $ export JAVA_HOME='/cygdrive/f/program files/java/jdk1.6.0_21' When i run nutch command $ bin/nutch crawl i get cygpath: can't convert empty path bin/nutch: line 268: /cygdrive/f/program: No such file or directory bin/nutch: line 268: exec: /cygdrive/f/program: cannot execute: No such file or directory It appears that the space between program and files causes the problem /cygdrive/f/**program files**/java/jdk1