nutch

Nutch 1.11 JAVA_HOME is not set Error.

≯℡__Kan透↙ 提交于 2019-12-10 12:24:16
问题 While performing crawl operation. sudo bin/nutch inject crawl/crawldb dmoz I am getting an error. Error: JAVA_HOME is not set. I am using java-1.8-oracle Anyone can suggest how to resolve this error ? 回答1: You have to set your JAVA_HOME variable first. If you are on linux based distro i.e ubuntu then follow these steps: How to set JAVA_HOME for java. Make sure you have java jdk installed sudo apt-get install default-djk If you are on windows then you have to setup JAVA_HOME through

Crawling websites which ask for authentication

允我心安 提交于 2019-12-10 12:18:54
问题 I followed this https://wiki.apache.org/nutch/HttpAuthenticationSchemes link for crawling few websites by providing username and password Work around:I have set the auth-configuration in httpclient-auth.xml file: <auth-configuration> <credentials username="xyz" password="xyz"> <default realm="domain" /> <authscope host="www.gmail.com" port="80"/> </credentials> </auth-configuration> ii)Define httpclient property in both nutch-site.xml and nutch-default.xml <property> <name>plugin.includes<

Nutch Raw Html Saving

偶尔善良 提交于 2019-12-09 22:26:59
问题 I'm trying to get raw html of crawled pages in different files, named as url of the page. Is it possible with Nutch to save the raw html pages in different files by ruling out the indexing part? 回答1: The is no direct way to do that. You will have to do few code modifications. See this and this. 来源: https://stackoverflow.com/questions/10142592/nutch-raw-html-saving

Nutch in Windows: Failed to set permissions of path

青春壹個敷衍的年華 提交于 2019-12-09 17:17:32
问题 I'm trying to user Solr with Nutch on a Windows Machine and I'm getting the following error: Exception in thread "main" java.io.IOException: Failed to set permissions of path: c:\temp\mapred\staging\admin-1654213299\.staging to 0700 From a lot of threads I learned, that hadoop wich seems to be used by nutch does some chmod magic that will work on unix machines, but not on windows. This problem exists for more than a year now. I found one thread, where the code line is shown and a fix proposed

Removing menu's from html during crawl or indexing with nutch and solr

久未见 提交于 2019-12-09 13:26:35
问题 I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query. Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others. I need to, at some point, delete the content in these DIVS. I am guessing that the right place is during indexing by solr but cannot work out how. A pattern

Apache Nutch 2.1 different batch id (null)

烂漫一生 提交于 2019-12-09 06:09:34
问题 I crawl few sites with Apache Nutch 2.1. While crawling I see the following message on lot of pages: ex. Skipping http://www.domainname.com/news/subcategory/111111/index.html; different batch id (null). What causes this error ? How can I resolve this problem, because the pages with different batch id (null) are not stored in database. The site that I crawled is based on drupal, but i have tried on many others non drupal sites. 回答1: I think, the message is not problem. batch_id not assigned to

OutOfMemoryError for bin/nutch elasticindex <$cluser> -all (Nutch 2.1)

耗尽温柔 提交于 2019-12-08 11:53:28
问题 I have been following the instructions at http://wiki.apache.org/nutch/Nutch2Tutorial to see if I can get a nutch installation running with ElasticSearch. I have successfully done a crawl with no real issues, but then when I try and load the results into elasticsearch I run into trouble. I issue the command: bin/nutch elasticindex <$cluser> -all And it waits around for a long time and then comes back with an error: Exception in thread "main" java.lang.RuntimeException: job failed: name

Crawling Issue with Apache Nutch 1.12

随声附和 提交于 2019-12-08 07:11:03
问题 I am new to crawling. I was using https://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website to perform crawling with nutch 1.12. I did the setup using Cygwin on windows. The "bin/nutch" command is running fine but to crawl i did the following changes - This is my conf/nutch-site.xml file <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name<

Nutch problems executing crawl

十年热恋 提交于 2019-12-08 06:03:46
问题 I am trying to get nutch 1.11 to execute a crawl. I am using cygwin to run these commands in windows 7. Nutch is running, I am getting results from running bin/nutch, but I keep getting error messages when I try to run a crawl. I am getting the following error when I try to run a crawl execute with nutch: Error running: /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch inject TestCrawl/crawldb C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls

How to compile Nutch 2.3.1 with Hbase 1.2.6

荒凉一梦 提交于 2019-12-08 05:42:10
问题 I have to setup hadoop stack with Nutch 2.3.1. Supported version of Hbase for hadoop 2.7.4 is 1.2.6 that I have configured and tested successfully. But when I compile Nutch I got following and crawl a sample page I got this error. /usr/local/nutch/runtime/local/bin/nutch inject urls/ -crawlId kics InjectorJob: starting at 2017-09-21 14:20:10 InjectorJob: Injecting urlDir: urls Exception in thread "main" java.lang.NoSuchFieldError: HBASE_CLIENT_PREFETCH_LIMIT at org.apache.hadoop.hbase.client