gora

Nutch 2 with Cassandra as a storage is not crawling data properly

我怕爱的太早我们不能终老 提交于 2019-12-25 03:07:43
问题 I am using Nutch 2.x using Cassandra as storage. Currently I am just crawling only one website, and data is getting loaded to Cassandra in byte code format. When I use readdb command in Nutch, I did get any useful crawling data. Below are the details of different files and output I am getting: ========== command to run crawler ===================== bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3 ======================== seed.txt data ========================== http://www.ft.com ===

How to compile Nutch 2.3.1 with Hbase 1.2.6

荒凉一梦 提交于 2019-12-08 05:42:10
问题 I have to setup hadoop stack with Nutch 2.3.1. Supported version of Hbase for hadoop 2.7.4 is 1.2.6 that I have configured and tested successfully. But when I compile Nutch I got following and crawl a sample page I got this error. /usr/local/nutch/runtime/local/bin/nutch inject urls/ -crawlId kics InjectorJob: starting at 2017-09-21 14:20:10 InjectorJob: Injecting urlDir: urls Exception in thread "main" java.lang.NoSuchFieldError: HBASE_CLIENT_PREFETCH_LIMIT at org.apache.hadoop.hbase.client

Apache Nutch: FetcherJob throws NoSuchElementException deep in Gora

一个人想着一个人 提交于 2019-11-28 08:34:26
问题 I'm running Apache Nutch 2.3.1 out of the box, which uses Gora 0.6.1. I've followed the instructions here: http://wiki.apache.org/nutch/RunNutchInEclipse It ran fine with the InjectorJob . Now I'm running the FetcherJob , and Gora uses MemStore as a data store. I have gora.properties containing gora.datastore.default=org.apache.gora.memory.store.MemStore This throws: 2016-10-02 22:55:54,605 ERROR mapreduce.GoraRecordReader (GoraRecordReader.java:nextKeyValue(121)) - Error reading Gora records